1 Introduction

Any computable Boolean function f(x) can be represented as a Boolean circuit \(C_{u, v}^g(x)\) with u input wires \(x = (\text {in}_1, \ldots , \text {in}_u)\)v output wires \(\text {out}_1, \ldots , \text {out}_v\), and \(g\) gates for some uvg. The size of such a Boolean circuit is \(n=u+v+g\). Universal circuits (UCs) are programmable circuits that can simulate any Boolean function f(x) up to a given size n. To program a UC to compute f, programming or control bits are specified as further inputs \(c^f=\{c_1, \ldots , c_m\}\). The UC then receives these control bits as inputs along with the input x and computes the result as \(UC(x, c^f) = f(x)\). This means that the same UC can evaluate different Boolean circuits by specifying the respective control bits. In analogy to a universal Turing machine, a universal circuit allows to turn any function into data in the form of a program description.

Several efficient constructions considering both the size and the depth of UCs were proposed. Valiant proposed in [66] an asymptotically size-optimal UC construction with size \(\Theta (n\log n)\) and depth \({\mathcal {O}}(n)\) [68]. He presents two constructions, called 2-way and 4-way UCs, based on so-called edge-universal graphs (EUGs) that utilize either 2 or 4 subcircuits, respectively. The asymptotic complexity of the 4-way UC is \({\sim }\,4.75 n\log _2 n\) which is smaller than that of the 2-way UC of \({\sim }\,5 n \log _2 n\) [66]. The 4-way UC has been further improved in [72], where its size is reduced to \({\sim }\,4.5n\log _2n\). An asymptotically depth-optimal construction with depth \(\Theta (d)\) that simulates circuits with depth d was proposed in [17], but it has a significantly larger size of \({\mathcal {O}}(n^3d/\log n)\). In our paper, due to the applications in cryptography that we revisit in Sect. 1.1, we concentrate on the existing size-optimized UCs, especially that proposed by Valiant [66] with asymptotic size \(\mathrm {\Theta }(n \log n)\) with the optimization presented by Zhao et al. in [72].

1.1 Applications of Universal Circuits

Size-optimized universal circuits have many applications, which we review here and refer to the original publications for a more detailed description.

1.1.1 Private Function Evaluation (PFE)

The most prominent application of universal circuits is the secure evaluation of private functions based on secure function evaluation (SFE) or secure computation. SFE enables two parties \(P_1\) and \(P_2\) to evaluate a publicly known function f(xy) on their respective private inputs x and y, ensuring that none of the participants learns anything about the other participant’s input apart from the output of the computation. Many secure computation protocols, such as Yao’s garbled circuit protocol [47, 69, 70] and the GMW protocol [32], use Boolean circuits for representing the desired functionality. In some applications, the function itself should be kept private. This setting is called private function evaluation (PFE), where we assume that only one of the parties \(P_1\) knows the function f(x), whereas the other party \(P_2\) provides the input to the private function x. \(P_2\) should learn no information about f except for an upper bound on the size of the circuit describing the function, and \(P_1\) should learn nothing about x beyond what can be inferred from the result f(x).

PFE can be reduced to SFE [1, 44, 58, 63] by securely evaluating a UC that is programmed by \(P_1\) to evaluate the function f on \(P_2\)’s input x. For this, \(P_1\) provides the control bits \(c^f\) for the UC and \(P_2\) provides his private input x into an SFE protocol that computes \(UC(x, c^f)\). Here, the UC is a public function and the control bits \(c^f\)—and therefore the function f—and input x are kept private due to the properties of SFE. The first implementation of PFE was provided in [44, 61], which extends the Fairplay secure computation framework [51] with universal circuits. The underlying UC construction achieves a non-optimal asymptotic size of \({\mathcal {O}}(n\log ^2 n)\) and depth \({\mathcal {O}}(n\log n)\). We have shown in [45] that it results in larger UCs than Valiant’s constructions for all reasonable circuit sizes in practice. The complexity of PFE in this case is determined mainly by the size and depth of the UC, while the security follows from that of the SFE protocol that is used to evaluate the UC. If the SFE protocol is secure against semi-honest, covert, or malicious adversaries, then the PFE protocol is secure in the same adversarial setting. UC-based PFE can be easily integrated into any SFE framework and can directly benefit from recent optimizations. For instance, outsourcing UC-based PFE to two or multiple servers using XOR secret sharing is directly possible with outsourced SFE [42]. The non-interactive secure computation protocol of [3] can be generalized to obtain a non-interactive PFE protocol [46]. Moreover, with UC-based PFE, evaluating public and private parts of a functionality can easily be performed together without modifying the underlying secure computation framework.

In [40], Katz and Malka presented an alternative approach for PFE that does not rely on UCs. They use additively homomorphic public-key encryption as well as a symmetric-key encryption scheme and achieve constant-round PFE with linear \({\mathcal {O}}(n)\) communication complexity. However, the number of public-key operations is linear in the circuit size, and due to the gap between the efficiency of public-key and symmetric-key operations, this results in a less efficient protocol. Their protocol is secure against semi-honest adversaries, uses Yao’s garbled circuits [70], and has recently been improved in [5], where the authors modify the algorithm to perform one full execution from which information can be reused in subsequent more efficient executions of the protocol. Mohassel and Sadeghian consider PFE with semi-honest adversaries in [53] and propose a generic PFE framework that can be instantiated with different secure computation protocols. Their first protocol uses homomorphic encryption with which they achieve linear complexity \({\mathcal {O}}(n)\) in the circuit size n and their second protocol relies solely on oblivious transfers (OT), which results in a method with \({\mathcal {O}}(n\log n)\) symmetric-key operations. The OT-based construction from [53] or PFE using UCs is more desirable than the linear homomorphic encryption-based methods in practice, since using OT extension, the number of expensive public-key operations can significantly be reduced, such that it is independent of the number of OTs [2, 36]. Biçer et al. [6] improve the communication of the OT-based PFE protocol of [53] by around \(40\%\). The asymptotic complexity of the OT-based construction of [53] and Valiant’s UCs for PFE is the same, and therefore, we compare these solutions for PFE in more detail in Sect. 8. Mohassel et al. extend the framework from [53] to malicious adversaries in [54] with linear complexity \({\mathcal {O}}(n)\), using additively homomorphic encryption. Active security of UC-based PFE is achieved by using a secure computation protocol with active security. Even though their claimed better efficiency, to the best of our knowledge, these protocols have not yet been implemented and are not as generally applicable as PFE with UCs, e.g., they cannot be easily combined with secure evaluation of public functions.

Semi-private function evaluation (semi-PFE) has been proposed in [60] and allows for PFE where the function f is in a set of functions \({\mathcal {F}}\) known by both parties. This relaxes the necessary topology hiding requirement of generic PFE. Yao’s garbled circuit can be used for evaluating circuits of the same topology as shown in [59]. Recently, an automated approach for semi-PFE has been proposed in [39], where the circuits representing \(f\in {\mathcal {F}}\) have varying topologies, for which a container topology is found that can be programmed to compute any of the available topologies. This has therefore been defined as a set-universal circuit, i.e., a circuit that can be programmed to compute any circuit from a pre-defined set of circuits. This approach has been further improved in [41], where a modified garbled circuit protocol allows for efficient semi-PFE with linear communication in the size of the largest circuit in \({\mathcal {F}}\). However, semi-PFE does not suffice for generic PFE where we have an exponential number of possible circuit topologies.

1.1.2 Applications of PFE

PFE can be applied in scenarios where one of the parties wants to keep the evaluated function private. One of the first applications for PFE was privacy-preserving checking for credit worthiness [21], where not only the loanee’s data, but also the loaner’s function that computes if the loanee is eligible for a credit needs to be kept private. The original scheme, using garbled circuits, can represent simple policies, but by evaluating a UC their scheme can be extended to more complicated credit checking policies. [15] shows an application for secure computation, where evaluating UCs or other PFE protocols would ensure privacy: When autonomous mobile agents migrate between several distrusting hosts, the privacy of the inputs of the hosts is achieved using SFE, while privacy of the mobile agent’s code can be guaranteed with PFE. [57] shows a method to filter remote streaming data obliviously, using secret keywords and their combinations. Their scheme can additionally preserve data privacy by using PFE to search the matching data with a private search function. PFE allows for running proprietary software on private data, such as privacy-preserving evaluation of diagnostic programs that was considered in [13], where the owner of the program does not want to reveal the diagnostic method and the user does not want to reveal his data. Example applications for such programs include medical diagnostics [9] and remote software fault diagnosis, where the function and the user’s input are desired to be handled privately. In the protocol presented in [13], the diagnostic programs are represented as binary decision trees or branching programs which can easily be converted into a Boolean circuit representation and evaluated using PFE based on universal circuits. Moreover, PFE can be applied to create blinded policy evaluation protocols [20, 24]. [20] utilizes UCs for so-called oblivious circuit policies and [18] for hiding the circuit topology in order to create one-time programs. In [25, 59], universal circuits are used for hiding queries in private database management systems (DBMSs). The Blind Seer DBMS [25] was improved in [59] by making use of a simpler UC for evaluating queries, which does not hide the circuit topology. The authors mention that in case the topology of the SQL formula and the circuit have to be kept private, a generic UC should be utilized. Further applications of PFE given in [53] are evaluation of branching programs on encrypted data [37] and privacy-preserving intrusion detection [56].

1.1.3 UC Applications Beyond PFE

Apart from being used for PFE, UCs can be applied in various other scenarios. Efficient verifiable computation on encrypted data was studied in [22]. A verifiable computation scheme was proposed for arbitrary computations, and a UC is required to hide the function. [29] make use of UCs for reducing the verifier’s preprocessing step. In [30], a DDH-based multi-hop homomorphic encryption scheme is proposed that uses re-randomizable garbled circuits, for which UCs are used to achieve function privacy. When the common reference string is dependent on a function that the verifier is interested in outsourcing, then the function description can be provided as input to a UC of appropriate size. As described in [4], the Attribute-based encryption (ABE) schemes [27, 34] for any polynomial-size circuits can be turned into ciphertext-policy ABE by using UCs. The ABE scheme of [28] also uses UCs. Universal circuits can be applied for program obfuscation. Candidates for indistinguishability obfuscation are constructed using a UC as a building block in [14, 26]. The algorithm of [26] has been implemented in [12], which can be improved using Valiant’s UC implementation [45]. Direct program obfuscation was proposed in [71], where the circuit is a secret key to a UC. [46] mentions that UCs can be applied for secure two-party computation in the batch execution setting, where the cost of evaluating Yao’s garbled circuits is amortized if the same circuit—a UC—is evaluated [35, 49]. This protocol has been made round-optimal in [52].

1.1.4 Implied Theoretical Results

We mention two theoretical results relying on UCs. Both the depth-optimized UC from [17] and Valiant’s size-optimized UCs were adapted in [8] to construct universal quantum circuits. The design of universal parallel computers was inspired by Valiant’s UCs as well [33, 50].

1.2 Our Contributions and Outline

In Sect. 2, we recapitulate the necessary preliminaries for our work. We revisit the asymptotically size-optimal UCs of [66] in Sect. 3. This complex construction makes use of an internal graph representation and programs a so-called edge-universal graph (Sect. 3.1). Thereafter, we describe how an edge-universal graph can be translated into a universal circuit (Sect. 3.2). Finally, we revisit Valiant’s 2-way (Sect. 3.3) and 4-way UCs (Sect. 3.4) and the improved building block proposed by Zhao et al. [72] for the latter.

Our modular programming algorithm (Sect. 4). We detail our modular algorithm for programming a universal circuit that provides the description of the input function f as program bits \(c^f\) to the UC, for both Valiant’s 2-way and 4-way UCs. Our method consists of two steps, the block edge-embedding (Sect. 4.1) and the recursion point edge-embedding (Sect. 4.2).

New universal circuit constructions and extensions (Sect. 5). We describe Lipmaa et al.’s generalization [46] of Valiant’s universal circuit to any k-way UC (Sect. 5.1) and detail how our modular programming algorithm from Sect. 4 can be directly generalized for this extension. We continue with presenting a new 3-way UC (Sect. 5.2) that is predicted to be more efficient than the existing UCs. However, after providing modular building blocks for this UC, we show that it is asymptotically larger than Valiant’s UCs, due to an optimization that cannot be applied for one of its building blocks. Then, we propose a hybrid UC construction (Sect. 5.3) that can efficiently combine k-way UCs for multiple values of k. With this, we combine Valiant’s 2-way and 4-way UCs to achieve the smallest universal circuit known so far. Lastly, we provide our scalable algorithms (Sect. 5.4) that allow for generating and programming UCs with only linear \({\mathcal {O}}(n)\) memory instead of handling the whole structure of size \({\mathcal {O}}(n\log n)\) in memory at once.

Optimized size and depth of UCs (Sect. 6). We compare the asymptotic (Sect. 6.1) and concrete (Sect. 6.2) sizes of Valiant’s (2-way and 4-way) UCs and that of different k-way UCs. We show that of all k-way UCs of Lipmaa et al. [46], Valiant’s 4-way UC provides the smallest size for large circuits, whereas Valiant’s 2-way UC provides the smallest depth. We include size optimizations, achieving a linear concrete improvement for all UCs. Moreover, we show that our 2/4 hybrid method for generating UCs improves over the 4-way UCs, i.e., both over Valiant’s 4-way UC and over the optimized 4-way UC of [72].

Implementation of Valiant’s UCs and experiments (Sect. 7). We detail the steps of our algorithm for a practical realization of Valiant’s UC construction and implement the 2-way and recently optimized 4-way UCs as well as our 2/4 hybrid UC construction. We note that our implementation is the first implementation that includes the optimization of Zhao et al. [72], which achieves the best size \({\sim }\,4.5n\log _2 n\) to date. We describe the architecture of our UC compiler (Sect. 7.1). We experimentally evaluate the performance of our UC generation and programming algorithms with a set of example circuits (Sect. 7.2). We provide the evaluation of our scalable 4-way UC as well and compare it with our memory-based implementation of Valiant’s 4-way UC.

Toolchain for private function evaluation using universal circuits (Sect. 8). We provide the implementation of an example application for universal circuits, namely of private function evaluation (PFE) by extending the ABY secure function evaluation framework [19] to evaluate our universal circuits (Sect. 8.1). We provide the first implementation for PFE with \({\mathcal {O}}(n\log n)\) complexity and show experimental results for performing PFE (Sect. 8.2). We theoretically compare PFE with UCs with other state-of-the-art approaches for PFE (Sect. 8.3).

1.3 Additions to Conference Versions

This journal article is a significantly extended and improved version of the conference publications [45] and [31]. Our added contributions are as follows.

  1. 1.

    Optimizations. We included the optimized building block of [72] in our 4-way and hybrid implementations as well as in the size and depth comparisons. This allows us to compare all state-of-the-art methods for UCs. This is the first implementation of their construction, which has the lowest asymptotic and concrete sizes known so far.

  2. 2.

    Scalability. We extend our design and implementation with a scalable 4-way UC construction based on Valiant’s 4-way UC, which reduces the memory complexity from \({\mathcal {O}}(n\log n)\) to \({\mathcal {O}}(n)\) when generating and programming the universal circuit. This construction involves a novel layer-by-layer approach for generating and topologically ordering the universal circuit and programs the structure according to the recursion steps, i.e., subcircuit by subcircuit.

  3. 3.

    Universal circuit depths. We examine the depth of the universal circuits in addition to their sizes, since though being optimized for the latter, some applications also require to minimize the former. For instance, the number of communication rounds in PFE via secure function evaluation with the GMW protocol [32]—which in contrast to Yao’s garbled circuits allows to precompute all symmetric cryptographic operations [64]—depends on the depth of the universal circuit.

  4. 4.

    Comparison and implementation. In our previous works, we have compared the 2-way and 4-way UCs with each other and with the only other existing UC of [44]. In this work, we implement the hybrid method that uses both 2-way and 4-way UCs and achieves the best concrete size for all simulated circuit sizes. We also implement our new scalable 4-way UC construction, which utilizes very different algorithms than those applied before for UC generation. We compare these methods with respect to runtime, communication, and memory consumption.

2 Preliminaries

As preliminaries for our paper, we introduce the graph and circuit theoretic background in Sect. 2.1 and Sect. 2.2, respectively. We provide a summary of all our notations and abbreviations in “Appendix A.”

2.1 Graph Theory

In this section, we describe the graph theoretic preliminaries necessary for our work.

Definition 1

The number of incoming [outgoing] edges of a node is called its indegree[outdegree]. A graph has fanin[fanout] \(\rho \) if the indegree [outdegree] of all its nodes is at most \(\rho \).

We denote by \(\Gamma _\rho (n)\) the set of all directed acyclic graphs with n nodes and fanin and fanout \(\rho \).

Definition 2

Let \(G=(V, E)\) be a directed graph with set of nodes \(V=\{1, \ldots , n\}\) and edges\(E \subseteq V \times V\). A mapping \(\eta ^G:V \rightarrow \{1, \ldots , n\}\) is called topological order if \((i, j) \in E\) implies that \(\eta ^G(i) < \eta ^G(j)\) and \(\forall i, j \in V: \eta ^G(i) = \eta ^G(j)\) means that \(i = j\). In short, \(i>j\) implies that there is no edge or directed path from i to j.

A topological order of \(G\in \Gamma _\rho (n)\) can be found with computational complexity \({\mathcal {O}}(\rho n)\). Further on, we require a labeling of the nodes in a topological order.

Definition 3

Edge-embedding is a mapping from graph \(G=(V, E)\) into \(G^\prime =(V^\prime , E^\prime )\) that maps V into \(V'\) one-to-one, with possible additional nodes in \(V'\), i.e., \(V \subseteq V^\prime \) and E into directed paths in \(E'\), such that all paths are pairwise edge-disjoint, i.e., an edge can be used only in one path.

Theorem 1

(Kőnig–Hall theorem) Given a directed acyclic graph (DAG) \(G \in \Gamma _2(n)\), the set of edges E can be separated into two disjoint sets \(E_1\) and \(E_2\), such that graphs \(G_1=(V, E_1)\) and \(G_2=(V, E_2)\) are instances of \(\Gamma _1(n)\), having fanin and fanout 1 for each node [38, 48, 66].

Proof of Theorem 1

Given the set of nodes in topological order \(V = \{1, \ldots , n\}\), we can construct a bipartite graph \({\overline{G}}=({\overline{V}}, {\overline{E}})\) with nodes \({\overline{V}}=\{m_1, \ldots , m_n, m'_1, \ldots , m'_n\}\) and edges \({\overline{E}}\) such that \((m_i, m'_j)\in {\overline{E}}\) if and only if \((i, j)\in E\). It is easy to see that the fanin and fanout of the resulting bipartite graph is also 2. The edges of \({\overline{G}}\) and thus the corresponding edges of G can be colored in a way that the result is a valid two-coloring. Having fanin and fanout of at most 2, such coloring can be found directly with the following method:

figure a

This edge-coloring can be performed in \({\mathcal {O}}(n)\) steps and it defines the edges in \(E_1\) and \(E_2\), such that \(E_1\) contains the edges colored with color one and \(E_2\) the ones with color two and \(G_1=(V, E_1)\) and \(G_2=(V, E_2)\). \(\square \)

The Kőnig–Hall theorem was used in [45, 46] to provide a 2-coloring algorithm for the edges of a graph with fanin and fanout 2. In its originally proposed form, however, Kőnig’s theorem [38, 48] applies also for k-coloring the edges of any graph with at most k incoming and outgoing edges for each of its nodes. This transformation can be easily generalized to graphs in \(\Gamma _k(n)\), in which case the resulting bipartite graph will have fanin and fanout k. We review this theorem and the corresponding algorithm here.

Theorem 2

(Kőnig’s theorem) If \({\overline{G}}\) is bipartite and its nodes have at most k incoming and outgoing edges, then the number of colors sufficient to color all edges of \({\overline{G}}\) is k.

Proof of Theorem 2

([38, 48]) Take colors \(\{1, \ldots , k\}\), and greedily color edges. Let us assume that at some point the coloring stops because we cannot color more edges. In this step, \((w_i, z_j)\) is an uncolored edge. If we look at the colors of the edges adjacent to \(w_i\) and \(z_j\), we can define the set of available colors for both nodes. There is at least one color for both  \(w_i\) and \(z_j\) due to the fanin and fanout restriction, but there is no color which is available for both nodes, otherwise we could color \((w_i, z_j)\).

There is a color that is used in an edge adjacent to \(w_i\), e.g., color a, but not on an edge adjacent to \(z_j\). In the same way, we can find another color b that is used in an edge adjacent to \(z_j\), but not to \(w_i\). Take the longest unique path P from \(w_i\) that uses colors a and b alternatingly.

Indirectly, assume that this path also contains \(z_j\). It then terminates in \(z_j\) due to the fact that \(z_j\) is not adjacent with an edge colored with a. Then, \(P\cup (w_i, z_j)\) is an odd cycle, which is impossible since \({\overline{G}}\) is bipartite. Therefore, p does not contain \(z_j\), and we can exchange colors a and b on path P and color \((w_i, z_j)\) with color a.

This process is continued until there are no uncolored edges in \({\overline{G}}\). \(\square \)

2.2 Circuit Theory

Definition 4

The fanin [fanout] of a circuit can be defined analogously to the fanin [fanout] of a graph (cf. Definition 1), i.e., the maximum number of incoming [outgoing] wires of all its gates, inputs and outputs.

Theorem 3

A circuit \(C_{u, v}^{{\hat{g}}}\) with u inputs, \({\hat{g}}\) gates, and v outputs and fanin and fanout \(\rho > 2\) can be transformed to a circuit \(C_{u, v}^g\) with fanin and fanout 2.

Proof of Theorem 3

Shannon’s expansion theorem [61, 62] describes how gates with larger fanin can be reduced to gates with two inputs by adding additional gates, which results in a circuit \(C_{u, v}^{{\tilde{g}}}\) with \({\tilde{g}}\) fanin 2 gates. It was proven in [66] that the general case, where the fanout of the circuit can be any integer \(\rho \ge 2\), can be transformed to the special case when \(\rho \le 2\) by introducing copy gates, each of which eliminates one from the extra fanout of the original gate. We place a binary tree in place of each gate with fanout larger than 2, following Valiant’s proposition: ,,Any gate with fanout \(x+2\)can be replaced by a binary fanout tree with \(x + 1\)gates” [66, Corollary 3.1]. Thus, the class of Boolean functions with u inputs and v outputs that can be realized by acyclic circuits with \({\tilde{g}}\) gates and arbitrary fanout can also be realized with an acyclic fanout-2 circuit with \({\tilde{g}}\le g\le 2{\tilde{g}}+ v\) gates.

Definition 5

We can regard \(C_{u, v}^g\) with u inputs, v outputs, and \(g\) gates as a \(\Gamma _2(n)\) graph G—which we commonly refer to as the graph of circuit \(C_{u, v}^g\)—with \(n=u + v+ g\) by creating a node for each input, gate, and output, and an edge for each wire in \(C_{u, v}^g\).

3 Valiant’s Universal Circuit Constructions

In any circuit \(C_{u, v}^{{\hat{g}}}\), the inputs of each of the \({\hat{g}}\) gates are either connected to one of the u inputs, to the output of a previous gate, or are assigned a fixed constant. Due to the nature of Valiant’s edge-universal graph (EUG) construction, the input circuit must have fanin and fanout 2, which can be achieved with the transformations described in Sect. 2.2 and implemented in [44, 45]. From here on, and without loss of generality, we assume that our input circuit \(C_{u, v}^g\) has u inputs, \(g\) gates and v outputs and fanin and fanout 2.

The size of a function f represented by a circuit \(C_{u, v}^g\) with fanin and fanout 2 is \(n = u + v + g\), which can be represented as a graph \(G\in \Gamma _{2}(n)\). In this section, we describe Valiant’s UC constructions [66, 68] that can be programmed to evaluate any function of size n. We explain the general idea behind Valiant’s UC construction [66] in Sects. 3.1 and 3.2, and the 2-way and 4-way UCs along with improvements of [31, 45, 46, 72] in Sects. 3.3 and 3.4, respectively.

3.1 Valiant’s Edge-Universal Graph Construction

Valiant’s UC construction relies on the notion of so-called edge-universal graphs that are then translated to universal circuits [66].

Definition 6

A graph \(U_n(\Gamma _{\rho })=(V_U, E_U)\) is an edge-universal graph (EUG) for \(\Gamma _{\rho }(n)\) if every graph \(G=(V, E)\) in \(\Gamma _{\rho }(n)\) can be edge-embedded (cf. Definition 3) into \(U_n(\Gamma _{\rho })\).

An EUG \(U_n(\Gamma _{\rho })\) has distinguished nodes called poles \(P=\{p_1, \ldots , p_n\} \subseteq V_U\) where each node \(a \in V = \{1, \ldots , n\}\) is mapped to exactly one pole with an injective mapping \(\varphi ^V: V \rightarrow V_U\). This mapping is defined by a concrete topological order \(\eta ^G\) of the original graph G with \(\varphi ^V(a) = p_{\eta ^G(a)}\), i.e., every node in G has a corresponding pole in \(U_n(\Gamma _{\rho })\). Apart from the poles, \(U_n(\Gamma _{\rho })\) might have additional nodes that enable the edge-embedding (cf. Sect. 2.1). For each edge \((a_i, a_j) \in E\), we then define a path of variable length z between the corresponding poles \(\varphi ^V(a_i)= p_{\eta ^G(a_i)} = b_1\) and \(\varphi ^V(a_j) = p_{\eta ^G(a_j)} = b_z\) as \((b_1, \ldots , b_z)\), where \(b_1, \ldots , b_z \in V_U\). All these paths are edge-disjoint, i.e., they do not use any edge in \(U_n(\Gamma _{\rho })\) in more than one path (cf. Sect. 2.1).

Let \(U_n(\Gamma _{1})\) be an EUG for graphs in \(\Gamma _{1}(n)\) with n poles \(P = \{p_1, \ldots , p_n\}\) (we will show concrete constructions for such EUGs in Sect. 3.3 and in Sect. 3.4). The nodes of any topologically ordered \(\Gamma _{1}(n)\) graph can be mapped to these poles. The poles have fanin and fanout 1, while all other nodes have fanin and fanout 2.

An EUG \(U_n(\Gamma _{\rho })\) for \(\rho \ge 2\) is created by taking \(\rho \) instances of \(U_n(\Gamma _{1})\) EUGs with poles \(P_1 = \{p_{1, 1}, \ldots , p_{1, n}\}, \ldots , P_\rho = \{p_{\rho , 1}, \ldots , p_{\rho , n}\}\), and merging each pole with its multiple instances, i.e., the set of merged poles \(P=\{p_1, \ldots , p_n\}\) is formed by merging \(p_{1, 1}, \ldots , p_{\rho , i}\) to obtain \(p_i\) for \(i=1, \ldots , n\). All edges are preserved, and thus, the poles have fanin and fanout \(\rho \), i.e., \(U_n(\Gamma _{\rho }) = (V_U^{\prime }, E_U^{\prime })\) is an EUG with fanin and fanout \(\rho \), constructed with \(U_n(\Gamma _{1})_1 = (V_1, E_1), \ldots ,\)\(U_n(\Gamma _{1})_{\rho } = (V_{\rho }, E_{\rho })\). P contains the merged poles and \(V_U^{\prime } = P \cup _{i=1}^{\rho } V_i \backslash P_i\) and \(E_U^{\prime } = \cup _{i=1}^{\rho } E_i\). Thus, the poles in \(U_n(\Gamma _{\rho })\) have at most \(\rho \) inputs and outputs, and all other nodes have at most two inputs and outputs.

Example. Let C be the circuit shown in Fig. 1a, and \(G=(V, E)\) be the graph of circuit C with 5 nodes shown in Fig. 1b. Our aim is to edge-embed G into EUG \(U_5(\Gamma _2)\). Therefore, we use two instances of \(U_5(\Gamma _1)\): \(U_5(\Gamma _1)_1\) in Fig. 1c and \(U_5(\Gamma _1)_2\) in Fig. 1d. The edges \({(a_1, a_4)}, {(a_2, a_3)}\) and \({(a_4, a_5)}\) are embedded in \(U_5(\Gamma _1)_1\), and the edges \({(a_1, a_3)}\) and \({(a_3, a_4)}\) in \(U_5(\Gamma _1)_2\). Merging the poles of \(U_5(\Gamma _1)_1\) and \(U_5(\Gamma _1)_2\) produces \(U_5(\Gamma _2)\) shown in Fig. 1e. In Sect. 3.2, we describe how to retrieve the resulting universal circuit depicted in Fig. 1f.

Recursion Base. Valiant’s construction is recursive, and the recursion base graphs for up to 6 nodes are shown in [66, Fig. 3] and [45, Fig. 1]. \(U_1(\Gamma _1)\) is a single pole and  \(U_2(\Gamma _1)\) and \(U_3(\Gamma _1)\) are two- and three-connected poles, respectively. Valiant provides hand-optimized EUGs for \(U_4(\Gamma _1)\), \(U_5(\Gamma _1)\) and \(U_6(\Gamma _1)\), with 3, 7, and 9 additional nodes, respectively (cf. [66, Fig. 3]).

Fig. 1
figure 1

a An example circuit and b the corresponding \(\Gamma _2(5)\) graph G. c, d The edge-embedding of G into two \(U_5(\Gamma _1)\) instances with poles \((p_1, \dots , p_5)\). e The edge-embedding of G into the \(U_5(\Gamma _2)\) graph of the universal circuit shown in (f)

3.2 Translating Edge-Universal Graphs into Universal Circuits

In this section, we define universal circuits (UCs) and describe how an edge-universal graph is translated into a universal circuit.

Definition 7

A universal circuit UC is a Boolean circuit that can be programmed to compute any circuit \(C_{u, v}^g\) up to a given size n by defining a set of programming bits \(c^f\) such that \(UC(x, c^f) = C_{u, v}^g(x)\).

In Valiant’s UC constructions, every node \(w \in V_U\) fulfills a task when \(U_n(\Gamma _{2})\) is translated to a UC. Programming the UC means specifying its control bits along the paths defined by the edge-embedding and by the gates of circuit \(C_{u, v}^g\). Depending on the number of incoming and outgoing edges and its type, a node w is translated as described below and shown in the example in Fig. 1f.

G1:

If w is a pole and corresponds to an input (one of the first u poles) or an output (one of the last v poles) in G, then w is an input or output in \(C_{u, v}^g\) as well.

G2:

If w is not a pole and has indegree 1 and outdegree 2, this node has been placed to copy its input to its two outputs. Therefore, when translated to a UC, w is replaced by multiple outgoing wires in the parent node (as described in [45]), since the UC does not need to fulfill the fanout 2 restriction. In \(U_n(\Gamma _{2})\), w is added due to the fanout 2 restriction in the EUG necessary for the edge-embedding.

G3:

If w is not a pole and has indegree and outdegree 1, w is removed and replaced by a wire between its parent and child nodes.

G4:

If w is a pole and corresponds to a gate (poles \(\{u+1, \ldots , u+g\}\)) in G, w is programmed as a universal gate (UG). A 2-input UG supports any of the 16 possible gate types represented by 4 control bits of the gate table \((c_1, c_2, c_3, c_4)\). It implements function U: \(\{0, 1\}^2 \times \{0, 1\}^4 \rightarrow \{0, 1\}\) that computes

$$\begin{aligned} U(x_1, x_2, c_1, c_2, c_3, c_4) = \overline{x_1}~\overline{x_2}c_1 + \overline{x_1}x_2c_2 + x_1 \overline{x_2} c_3 + x_1 x_2 c_4. \end{aligned}$$
(1)
G5:

If w is not a pole and has indegree and outdegree 2, w is programmed as an X-switching block, which computes \(X: \{0, 1\}^2 \times \{0, 1\} \rightarrow \{0, 1\}^2\) with \(X((x_1, x_2), c) = (x_{1+c}, x_{2-c})\) as shown in Fig. 2a. The inputs of an X-switching block are forwarded to its outputs, switched or not switched, depending on control bit c.

G6:

If w is not a pole and has indegree 2 and outdegree 1, w is programmed as a Y-switching block that computes \(Y: \{0, 1\}^2 \times \{0, 1\} \rightarrow \{0, 1\}\) with \(Y((x_1, x_2), c) = x_{1+c}\) as visualized in Fig. 2b. The inputs of a Y-switching block are forwarded to its output depending on the control bit c, i.e., it provides the functionality of a 2-input multiplexer.

Fig. 2
figure 2

Programmable switching blocks [43]

We note that the u inputs and the v outputs can be ordered arbitrarily within themselves as long as the inputs are kept before the \(g\) topologically ordered gates and the outputs after them. Even though the output nodes cause an overhead in Valiant’s UC, they are required to fully hide the topology of the circuit in the corresponding universal circuit. Note that optionally it is possible to modify the input circuit such that the outputs of the last v gates in order are the outputs of the circuit by inserting at most v copy gates [40].

The nodes programmed as UG (G4), X-switching block (G5), or Y-switching block (G6) are so-called programmable blocks. This means that a control bit c or vector \({\overline{c}}=(c_1, c_2, c_3, c_4)\) is necessary aside from the two inputs to define their behavior. The universal gates are programmed according to the simulated gates in \(C_{u, v}^g\) and the universal switches according to the paths defined by the edge-embedding of the graph of the circuit G into the edge-universal graph \(U_n(\Gamma _{2})\). Depending on whether the path takes the same direction during the embedding (e.g., arrives from the left and continues on the left) or changes its direction at a given node (e.g., arrives from the left and continues on the right), the control bit of the universal switch is programmed accordingly. In Sect. 7.1, we describe efficient implementations of programmable blocks. All control bits and vectors together are the programming \(c^f\) of the UC.

3.3 Valiant’s 2-way UC Construction

Fig. 3
figure 3

Body block \(B^{(2)}\) of Valiant’s 2-way EUG \(U_n^{(2)}(\Gamma _1)\) [66]

We described in Sect. 3.1 that a \(U_n(\Gamma _{\rho })\) EUG can be constructed of \(\rho \) instances of \(U_n(\Gamma _{1})\) EUGs. Valiant [66] provides an EUG for \(\Gamma _{1}(n)\) graphs, two of which can build an EUG for \(\Gamma _{2}(n)\) graphs, which suffices for circuits with 2-input gates that have at most two outgoing wires. Let \(P = \{p_1, \ldots , p_n\}\) be the set of poles in \(U_n(\Gamma _{1})\) that have indegree and outdegree 1, corresponding to the inputs, gates and outputs of the input circuit \(C_{u, v}^g\), i.e., poles \(P_\text {in}=\{p_1, \ldots , p_u\}\) correspond to the inputs, \(P_\text {gate}=\{p_{(u + 1)}, \ldots , p_{(u + g)}\}\) to the gates, \(P_\text {out}=\{p_{(u + g+ 1)}, \ldots , p_n\}\) to the outputs. The main, so-called body block \(B^{(2)}\) used for constructing Valiant’s EUG for \(\Gamma _{1}(n)\) graphs \(U_n^{(2)}(\Gamma _{1})\) of size \({\sim }\,2.5n\log _2 n\) is shown in Fig. 3 and consists of 2 poles (large circles), 4 so-called recursion points (rectangles), and 3 additional nodes (small circles). The corresponding UC has twice the size \({\sim }\,5n\log _2 n\), since it corresponds to an EUG for \(\Gamma _{2}(n)\) graphs. This construction is called the 2-way EUG or UC construction since there are two sets of recursion nodes at each recursion step as we describe below.

Fig. 4
figure 4

Skeleton built of a chain of body blocks \(B^{(2)}\) of Valiant’s 2-way EUG \(U_n^{(2)}(\Gamma _1)\)

The recursive construction works as follows: The rectangles are special nodes that build up the set of poles in the next recursion step, i.e., \(R^1_{\lceil \frac{n}{2}-1\rceil } = \{r^1_1, \ldots , r^1_{\lceil \frac{n}{2}-1\rceil }\}\) and \(R^2_{\lceil \frac{n}{2}-1\rceil } = \{r^2_1, \ldots \, r^2_{\lceil \frac{n}{2}-1\rceil }\}\) are the poles of two smaller edge-universal graphs called subgraphs. EUGs are built with these poles which produce new subgraphs with size \(\lceil \frac{\lceil \frac{n}{2}-1\rceil }{2}-1\rceil \), such that we have four subgraphs at the next level, etc. The blocks are chained together at the recursion points to form a skeleton, i.e., each recursion point belongs to two in the corresponding subgraph. Thus, the main skeleton of the UC consists of \(\lceil \frac{n}{2}\rceil \) such blocks with poles \(\{p_1, p_2, \ldots , p_n\}\), and the next two skeletons consist of \(\lceil \frac{\lceil \frac{n}{2}-1\rceil }{2}\rceil \) blocks with sets of poles \(\{r^1_1, \ldots , r^1_{\lceil \frac{n}{2}-1\rceil }\}\) and \(\{r^2_1, \ldots \, r^2_{\lceil \frac{n}{2}-1\rceil }\}\). We visualize the process of chaining the blocks together to form this skeleton in Fig. 4.

We note that the top (resp. bottom) block of a skeleton does not need the upper (resp. lower) recursion points since its poles are the inputs (resp. outputs) in the block. Therefore, we presented optimized so-called head \(H^{(2)}\) and tail \(T^{(2)}\) blocks that occur in the top and bottom of a skeleton, respectively, in [31, Fig. 2b–e].

Theorem 4

([66]) The resulting 2-way EUG is edge-universal, and therefore, the resulting circuit is universal.

Proof of Theorem 4 [Val76]

We recapitulate the proof from [66] that \(U_n^{(2)}(\Gamma _{1})\) is edge-universal for \(\Gamma _{1}(n)\), such that any graph with n nodes and fanin and fanout 1 can be edge-embedded into \(U_n^{(2)}(\Gamma _{1})\). According to the definition of edge-embedding, it has to be shown that given any \(\Gamma _{1}(n)\) graph \(G = (V, E)\), for any \((i, j) \in E\) and \((k, l) \in E\) we can find pairwise edge-disjoint paths from \(p_i\) to \(p_j\) and from \(p_k\) to \(p_l\) in \(U_n^{(2)}(\Gamma _{1})\). As before, the labeling of nodes \(V=\{1, \ldots , n\}\) in G is according to a topological order of the nodes.

Firstly, each two neighboring poles of the EUG, \(p_{2s}\) and \(p_{2s+1}\) for \(s \in \{1, \ldots , \lceil \frac{n}{2}\rceil \}\), are thought of as merged poles, so-called superpoles, with their fanin and fanout becoming 2. In a similar manner, any \(G\in \Gamma _{1}(n)\) graph can be regarded as a \(\Gamma _{2}(\lceil \frac{n}{2} \rceil )\) graph with supernodes, i.e., each pair \((2s, 2s+1)\) will be merged into one node in a \(\Gamma _{2}(\lceil \frac{n}{2} \rceil )\) graph \(G'=(V', E')\). If there are edges between the nodes in G, they are simulated with loops. The set of edges of this graph G is partitioned to disjoint sets \(E_1\) and \(E_2\), such that \(G_1=(V, E_1)\) and \(G_2=(V, E_2)\) are instances of \(\Gamma _{1}(\lceil \frac{n}{2} \rceil )\) and \(\Gamma _{1}(\lfloor \frac{n}{2} \rfloor )\), respectively. This can be done efficiently, as shown in Theorem 1. The edges in \(E_1\) are embedded as directed paths in \(R^1_{\lceil \frac{n}{2}-1 \rceil }\), and the edges in \(E_2\) as directed paths in \(R^2_{\lceil \frac{n}{2}-1 \rceil }\). Both \(E_1\) and \(E_2\) have at most one edge directed into and at most one directed out of any supernode, and therefore, there is only one edge from \(E_1\) and one from \(E_2\) to be simulated going through any superpole in \(U_n^{(2)}(\Gamma _{1})\) as well. Thus, the edge coming into a superpole \((p_{2s}, p_{2s+1})\) in \(E_1\) is embedded as a path through \(r^1_{s-1}\), while the edge going out of the pole in \(E_1\) is embedded as a path through \(r^1_s\) in the appropriate subgraph. Similarly, the edges in \(E_2\) are simulated as edges through \(r^2_{s-1}\) and \(r^2_s\). These paths can be chosen disjoint according to the induction hypothesis. Finally, the paths from \(r^1_{s-1}\) and \(r^2_{s-1}\) to superpole \((p_{2s-1}, p_{2s})\) as well as the paths from \((p_{2s-1}, p_{2s})\) to \(r^1_s\) and \(r^2_s\) can be chosen edge-disjoint due to the skeleton built up of the body blocks shown in Fig. 3. With this, Valiant’s graph construction results in a valid EUG with asymptotically optimal size \({\mathcal {O}}(n\log n)\) and depth \({\mathcal {O}}(n)\) [66]. With the building blocks described in Sect. 3.2, it is easy to see that the resulting Boolean circuit is universal. \(\square \)

Implementation. We provided an open-source implementation of this 2-way UC optimized for PFE in [45]. In concurrent and independent related work, Lipmaa et al. [46] also showed the practicality of Valiant’s 2-way UC. They decrease its total number of gates compared to that of Valiant’s block (Fig. 3) by one XOR gate. However, the number of AND gates is exactly the same, and therefore, their improvement does not affect PFE using UCs, where XOR gates are evaluated for free [44].

Fig. 5
figure 5

Body block \(B^{(4)}\) alternatives for 4-way EUG \(U_n^{(4)}(\Gamma _1)\)

3.4 Valiant’s 4-way UC Construction

Similarly to the 2-way EUG construction (cf. Sect. 3.3), Valiant provides a more efficient 4-way EUG or UC construction [66] for \(\Gamma _{1}(n)\) graphs which can be extended to an EUG for \(\Gamma _{2}(n)\) graphs by utilizing two instances \(U_n^{(4)}(\Gamma _{1})_1\) and \(U_n^{(4)}(\Gamma _{1})_2\) as described in Sect. 3.1. \(U_n^{(4)}(\Gamma _{1})\) has a 4-way recursive structure, i.e., at each recursion step, nodes in special sets \(R^1_{\lceil \frac{n}{4}-1\rceil } = \{r^1_1, \ldots \, r^1_{\lceil \frac{n}{4}-1\rceil }\}\), \(R^2_{\lceil \frac{n}{4}-1\rceil } = \{r^2_1, \ldots \, r^2_{\lceil \frac{n}{4}-1\rceil }\}\), \(R^3_{\lceil \frac{n}{4}-1\rceil } = \{r^3_1, \ldots \, r^3_{\lceil \frac{n}{4}-1\rceil }\}\) and \(R^4_{\lceil \frac{n}{4}-1\rceil } = \{r^4_1, \ldots \, r^4_{\lceil \frac{n}{4}-1\rceil }\}\)Footnote 1 are the poles in the next recursion step (the main body block is shown in Fig. 5a). The recursion base is the same as for the 2-way UC construction described in Sect. 3.1. This construction results in UCs of smaller size \({\sim }\,4.75 n\log _2n\) but has a more complicated structure and programming algorithm. We have studied and implemented this universal circuit in [31] and recapitulate our results here and in Sect. 7. Valiant offers the main, so-called body block \(B^{(4)}\) consisting of 4 poles (large circles), 15 nodes (small circles) as well as 8 recursion points (rectangles) shown in Fig. 5a. As before, we provide so-called head \(H^{(4)}\) and tail \(T^{(4)}\)blocks that occur at the top and bottom of a skeleton in [31, Figs. 4b-4i], respectively. The blocks are connected such that the 4 top (resp. bottom) recursion points of one block are the 4 bottom (resp. top) recursion points of the next block. Similarly to the 2-way EUG, 4 sets are created for n nodes, i.e., \(R^1_{\lceil \frac{n}{4}-1\rceil }\), \(R^2_{\lceil \frac{n}{4}-1\rceil }\), \(R^3_{\lceil \frac{n}{4}-1\rceil }\), and \(R^4_{\lceil \frac{n}{4}-1\rceil }\) which are the poles of 4 \(U_{\lceil \frac{n}{2}\rceil -1}(\Gamma _1)\) EUGs in the next recursion step. Then, these also create 4 subgraphs until the recursion base is reached (cf. Sect. 3.1).

Recently, Zhao et al. in [72] optimized the body block of Valiant’s UC by finding a more efficient block using exhaustive search over all possible blocks. As opposed to Valiant’s UC that uses 15 additional nodes in the body block, their block uses only 14 additional nodes, and therefore, their UC achieves an asymptotically better size of \({\sim }\,4.5 n\log _2 n\). We depict the further optimized body block \(B^{(4)}\) of Zhao et al. in Fig. 5b. Zhao et al. provide a computer generated proof of that this block can indeed be used to construct universal circuits. Moreover, they show that there exists no block with only 13 additional nodes that can be used to construct UCs in the same manner. This proves that the minimal size of a 4-way UC is the achieved \({\sim }\,4.5n\log _2n\).

Theorem 5

([66]) The resulting 4-way EUG is edge-universal, and therefore, the resulting circuit is universal.

The proof of this theorem is analogous to that of Theorem 4.

4 Programming Valiant’s Universal Circuits

We designed the detailed embedding algorithm and the open-source UC implementation of [45] specifically for the 2-way UC, dealing with the whole UC skeleton as one block. In contrast, based on the modular design of [46], we modularized the edge-embedding task into multiple subtasks and described how they can be performed separately in [31]. In this section, we detail this modular approach for edge-embedding a graph into Valiant’s \(\ell \)-way EUG, where \(\ell =2\) or \(\ell =4\): The edge-embedding can be split into two parts, which are then combined.

In the following, we describe the two main steps of our modular approach presented in [31] that are based on the edge-embedding algorithm of [45]. 1) Block edge-embedding (Sect. 4.1) allows for the programming of the blocks visualized in Fig. 3 on p. 12 and in Figs. 5a or b on p. 14.2) Recursion point edge-embedding (Sect. 4.2) takes care of the programming of the whole UC. Here, the paths are defined and the necessary information is provided to the blocks (cf. Sect. 4.2). The process can be generalized to any \(2^i\)-way EUG. Moreover, the same modular edge-embedding algorithm can be applied with a few modifications for Lipmaa et al.’s generalization to any k-way UC [46], which we describe later in Sect. 5.1.

4.1 Block Edge-Embedding

We consider the \(\ell \) top (resp. bottom) recursion points of a block (Figs. 3 and 5a or b) as intermediate nodes where the inputs (resp. outputs) of the block enter (resp. exit). The blocks are built so that any of these inputs can be forwarded to exactly one of the \(\ell \) poles of the block and the output of any pole can be forwarded to an output or another pole with a higher topological order.

We formalize this behavior as follows: In \(U_n^{(\ell )}(\Gamma _{1})=(V_U, E_U)\), let \(B^{(\ell )}\) be the \((i - 1)\)th block in the skeleton made up of blocks visualized in Fig. 3 for \(\ell =2\) and Fig. 5a or b for \(\ell =4\) with poles \(p_{\ell i+1}, \ldots , p_{\ell i+\ell }\). Let the mapping \(\eta ^U: V_U\rightarrow {\mathbb {N}}^+\) denote a topological order of all nodes and poles in \(V_U\). Then, the nodes \(r_i^1, \ldots , r_i^\ell \) and \(r_{i+1}^1, \ldots , r_{i+1}^\ell \) denote the input and output recursion points of block \(B^{(\ell )}\). Additionally, let \({in} = (in_1, \ldots , in_\ell ) \in \{0, \ldots , \ell \}^\ell \) and \({out}=(out_1, \ldots , out_\ell ) \in \{0, \ldots , 2\ell -1\}^\ell \) denote the input and output vectors of \(B^{(\ell )}\). The value 0 of the input and output vectors is a dummy value which is used if there is no specific path between an input and a pole, or between a pole and an output of \(B^{(\ell )}\). The output vector has a larger value range, since a pole can be forwarded to another pole or an output recursion point. Therefore, we use values \(1, \ldots , \ell -1\) for poles \(p_{\ell i + 2}, \ldots , p_{\ell i + \ell }\) and values \(\ell , \ldots , 2\ell -1\) for the output recursion points. Pole \(p_{\ell i + 1}\) cannot be a destination for a path in \(B^{(\ell )}\), since \(\eta ^U(p_{\ell i + 1})\) is less than the topological order of any other pole in \(B^{(\ell )}\). Additionally, the values of in and out need to be pairwise different or 0. Every combination of input and output vector covering the conditions formalized below in Eqs. 26 is valid for \(B^{(\ell )}\). A pair \((r_i^l, p_j) \in {\mathcal {P}}\) or \((p_j, r_{i+1}^l) \in {\mathcal {P}}\) is a path from \(r_i^l\) to \(p_j\) or \(p_j\) to \(r_i^l\) in the set of all paths \({\mathcal {P}}\) in \(B^{(\ell )}\). Then, \({\mathcal {P}}_B^{(\ell )} \subseteq {\mathcal {P}}\) denote the paths that are to be edge-embedded (cf. Sect. 3.1).

$$\begin{aligned} \textsc {InPolePath:}&\forall l \in \{1, \ldots , \ell \}: in_l \ne 0 \rightarrow&(r_i^l, p_{\ell i + in_l}) \in {\mathcal {P}}_B^{(\ell )},\end{aligned}$$
(2)
$$\begin{aligned} \textsc {PolePolePath:}&out_l \ne 0 \wedge out_l< \ell \rightarrow&(p_j, p_{\ell i + 1 + out_l}) \in {\mathcal {P}}_B^{(\ell )} \wedge \eta ^U(p_j) < \eta ^U(p_{\ell i + 1 + out_l}), \end{aligned}$$
(3)
$$\begin{aligned} \textsc {PoleOutPath:}&out_l > \ell -1 \rightarrow&(p_{\ell i + l}, r_{i+1}^{out_l - \ell - 1}) \in {\mathcal {P}}_B^{(\ell )}.\end{aligned}$$
(4)
$$\begin{aligned} \textsc {InDiff:}&\forall in_i, in_j \in in: i \ne j \rightarrow&in_i = 0 \vee in_i \ne in_j. \end{aligned}$$
(5)
$$\begin{aligned} \textsc {OutDiff:}&\forall out_i, out_j \in out: i \ne j \rightarrow&out_i = 0 \vee out_i \ne out_j. \end{aligned}$$
(6)

4.2 Recursion Point Edge-Embedding

Block edge-embedding covers only the programming of the nodes within the blocks of the UC. Another task is to program the recursion points. We use the construction of [45] which, in every step, splits a \(\Gamma _{2}(n)\) graph in two \(\Gamma _{1}(n)\) graphs, which are merged to two \(\Gamma _{2}(\lceil \frac{n}{2}-1\rceil )\) graphs. This, as described later, results in a tree of graphs with fanin and fanout one or two called supergraph [45]. We use this supergraph for defining the paths in Valiant’s 2-way EUG. For Valiant’s 4-way EUG, we use every second step of the algorithm with a minor modification. We describe our modular algorithm for the 2-way and 4-way UCs below and in Listing 1.

Let \(C_{u, v}^k\) be the Boolean circuit computing function f that our UC needs to compute and \(G \in \Gamma _{2}(n)\) its graph representation (cf. Sect. 2.2).

  1. 1.

    Splitting \(G \in \Gamma _{2}(n)\) in two  \(\Gamma _{1}(n)\) graphs  \(G_1\) and  \(G_2\): As described in Sect. 3.1, Valiant’s UC is derived from an EUG for \(\Gamma _{2}(n)\) graphs, which is built up of two EUGs (\(U_n^{(\ell )}(\Gamma _{1}))_1\) and \((U_n^{(\ell )}(\Gamma _{1}))_2\) for \(\Gamma _{1}(n)\) graphs merged by their poles. G is similarly split into two \(\Gamma _{1}(n)\) graphs \(G_1\) and \(G_2\), which then need to be edge-embedded into \((U_n^{(\ell )}(\Gamma _{1}))_1\) and \((U_n^{(\ell )}(\Gamma _{1}))_2\), respectively. \(G=(V, E)\in \Gamma _{2}(n)\) is split by 2-coloring its edges [45, 66], which can always be done due to Kőnig’s theorem [38, 48] recapitulated in Theorems 1 and 2 on p. 7–8. After 2-coloring, E is divided into sets \(E_1\) and \(E_2\), using which we build \(G_1 = (V, E_1)\) and \(G_2 = (V, E_2)\), with the following conditions:

    $$\begin{aligned} \textsc {EdgeIn}E_1\textsc {or}E_2:&\quad \forall e \in E: (e \in E_1 \vee e \in E_2) \wedge \lnot (e \in E_1 \wedge e \in E_2). \end{aligned}$$
    (7)
    $$\begin{aligned} \textsc {Fanin1}E_1:&\quad \forall e = (v_1, v_2) \in E_1: \lnot \exists e' = (v_3, v_4) \in E_1: v_2 = v_4 \vee v_1 = v_3. \end{aligned}$$
    (8)
    $$\begin{aligned} \textsc {Fanin1}E_2:&\quad \forall e = (v_1, v_2) \in E_2: \lnot \exists e' = (v_3, v_4) \in E_2: v_2 = v_4 \vee v_1 = v_3. \end{aligned}$$
    (9)
  2. 2.

    Merging a  \(\Gamma _{1}(n)\) graph into a \(\Gamma _{2}(\lceil \frac{n}{2}-1\rceil )\)graph. In an EUG, the number of poles decreases in each recursion step and merging a \(\Gamma _{1}(n)\) graph into a \(\Gamma _{2}(\lceil \frac{n}{2}-1\rceil )\) graph provides information about the paths to be taken. Let \(G_1 = (V, E) \in \Gamma _{1}(n)\) be a topologically ordered graph and \(G_m = (V^\prime , E^ \prime ) \in \Gamma _{2}(\lceil \frac{n}{2}-1\rceil )\) be a graph with nodes \(V'=\{v^\prime _1, \ldots , v^\prime _{\lceil \frac{n}{2}\rceil }\}\). We define two labelings \(\eta _{\text {in}}\) and \(\eta _{\text {out}}\) on \(G_m\) with \(\eta _{\text {in}}(v_i) = i\) and \(\eta _{\text {out}}(v_i) = \eta _{\text {in}}(v_i) - 1 = i - 1\). Additionally, we define a mapping \(\theta _V\) that maps a node \(v_i \in V\) to a node \(v_j \in V^\prime \) with \(\theta _V(v_i) = v^\prime _{\lceil \frac{i}{2}\rceil }\), i.e., two nodes in \(G_1\) are mapped to one node in \(G_m\). At last, we define a mapping \(\theta _E\) that maps an edge \(e_i = (v_i, v_j) \in E\) to an edge \(e_j \in E^\prime \) with \(\theta _E((v_i, v_j)) = (v_{\eta _{\text {in}}(\theta _V(v_i))}, v_{\eta _{\text {out}}(\theta _V(v_j))})\), i.e., every edge in \(G_1\) is mapped to an edge in \(G_m\) as follows: \(e = (v_i, v_j) \in E\) is mapped to \(e^\prime = (v_k^\prime , v_l^\prime ) \in E^\prime \), such that \(v_k^\prime = \theta _V(v_i)\), and the new node of \(v_j\) in \(G_m\) is \(v_{l+1}^\prime \) (not \(v_l^\prime \)). \(G_m\) is built as follows: \(V' = \{v_1^\prime , \ldots , v_{\lceil \frac{n}{2}\rceil }^\prime \}\) and \(E^\prime =\bigcup _{e \in E}\theta _E(e)\). Then for all \(e = (v_i^\prime , v_j^\prime ) \in E^\prime \) and \(j < i\), e is removed from \(E^\prime \), along with the last node \(v_{\lceil \frac{n}{2}\rceil }\) (due to the definition of \(\theta _E\), it does not have any incoming edges). The resulting \(G_m\) is a topologically ordered graph in \(\Gamma _{2}(\lceil \frac{n}{2}-1\rceil )\).

  3. 3.

    The supergraph for Valiant’s EUG construction. In the first step, G is split into two \(\Gamma _{1}(n)\) graphs \(G_1\) and \(G_2\). \(G_1\) and \(G_2\) contain all the edges that should be embedded as paths between poles in the first and second EUGs for \(\Gamma _{1}(n)\), respectively. We now explain how to edge-embed the \(\Gamma _{1}(n)\) graph \(G_1\) into an EUG \(U_n^{(\ell )}(\Gamma _{1})\) (for \(G_2\) it is analogous).

For edge-embedding in the 2-way EUG, \(G_1\) is first merged to a \(\Gamma _{2}(\lceil \frac{n}{2}-1\rceil )\) graph \(G_{m}\). \(G_{m}\) is then 2-colored and split into two \(\Gamma _{1}(\lceil \frac{n}{2}-1\rceil )\) graphs \(G_1^{\text {1}}\) and \(G_1^{\text {2}}\) [45]. These get merged to two graphs \(G_{m}^{\text {1}}\) and \(G_{m}^{\text {2}}\), which are then 2-colored and split into two \(\Gamma _{1}(\lceil \frac{\lceil \frac{n}{2}-1\rceil }{2}-1\rceil )\) graphs. These steps are repeated until the recursion base is reached. In the supergraph, \(G_1^{\psi \circ \text {1}}\) and  \(G_1^{\psi \circ \text {2}}\) are the first and second subgraphs of \(G_1^{\psi }\) for any \(\psi \), respectively.

In Valiant’s 4-way EUG construction [66], a supergraph that creates 4 subgraphs in each step is necessary. We require a merging method where a \(\Gamma _{1}(n)\) graph is merged to a \(\Gamma _4(\lceil \frac{n}{4}-1\rceil )\) graph where 4 nodes build a new node, and 4-color this graph to retrieve 4 subgraphs. However, this can directly be solved by using the method described above from [45]: After repeating the 2-coloring and the merging twice, we gain 4 subgraphs (\(G_1^{\text {11}}\)\(G_1^{\text {12}}\)\(G_1^{\text {21}}\) and \(G_1^{\text {22}}\)). These can be used as if they were the result of 4-coloring the graph obtained by merging every 4 nodes into one.

However, there is a modification in this case: The first 2-coloring is a preprocessing step, which does not map to an EUG recursion step. Therefore, we have to define another labeling \(\eta _{\text {out}_P}(v) = \eta _{\text {in}}(v)\), since in this preprocessing step we need to keep node \(v_{\lceil \frac{n}{2}\rceil }\). Then the creation of the supergraph for the 4-way EUG construction works as follows: We merge \(G_1\) to a \(\Gamma _{2}(\lceil \frac{n}{2}\rceil )\) graph with labeling \(\eta _{\text {in}}\) and \(\eta _{\text {out}_P}\) and get \(G_{m}\). After that, we split \(G_{m}\) into two \(\Gamma _{1}(\lceil \frac{n}{2}\rceil )\) graphs \(G_1^{\text {1}}\) and \(G_1^{\text {2}}\). These get merged to \(\Gamma _{2}(\lceil \frac{n}{4}\rceil -1)\) graphs \(G_{m}^{\text {1}}\) and \(G_{m}^{\text {2}}\) using the \(\eta _{\text {in}}\) and \(\eta _{\text {out}}\) labelings. Finally, these two graphs get split into 4 \(\Gamma _{1}(\lceil \frac{n}{4}-1\rceil )\) graphs \(G_1^{\text {11}}\)\(G_1^{\text {12}}\)\(G_1^{\text {21}}\), and \(G_1^{\text {22}}\). These are the relevant graphs for the first recursion step in Valiant’s 4-way EUG construction. Then we continue for all 4 subgraphs until we reach the recursion base.

\(\ell \)-way Edge-Embedding Algorithm. In Listing 1, we combine block edge-embedding and recursion point edge-embedding.

figure b

Let \({\mathcal {U}}\) denote the part of \(U_n^{(\ell )}(\Gamma _{1})\) without recursion steps (the main skeleton) and \(G_1 = (V, E)\) be the \(\Gamma _{1}(n)\) graph which is to be edge-embedded in \(U_n^{(\ell )}(\Gamma _{1})\). \({\mathcal {S}}\) denotes the set of \(\ell \) subgraphs of \(G_1\) in the supergraph, i.e., \({\mathcal {S}}=\{G_1^{\text {1}}, G_1^{\text {2}}\}\) for \(\ell =2\) and \({\mathcal {S}}=\{G_1^{\text {11}}, G_1^{\text {12}}, G_1^{\text {21}}, G_1^{\text {22}}\}\) for \(\ell =4\). A recursion step graph of \({\mathcal {U}}\) is one of the graphs having one of the \(\ell \) sets of recursion points as poles (e.g., \(r^1_1, \ldots , r^1_{\lceil \frac{n}{\ell }-1\rceil }\)) without the recursion steps. \({\mathcal {R}}\) denotes the set of all \(\ell \) recursion step graphs of \({\mathcal {U}}\), and \({\mathcal {B}}\) denotes the set of all blocks in \({\mathcal {U}}\).

We give a brief explanation of Listing 1 that describes the edge-embedding process. For any edge \(e = (v_i, v_j) \in E\) in \(G_1\), \(b_i\) and \(b_j\) denote the block numbers in which \(v_i\) and \(v_j\) are. We distinguish between two cases:

Case 1. \(v_i\) and \(v_j\) are in the same block:  \(b_i = b_j\). The edge-embedding is solved within the block, and no recursion points have to be programmed for the path. Therefore, vector out of block \({\mathcal {B}}[b_i]\) is set accordingly.

Case 2. \(v_i\) and \(v_j\) are in different blocks:  \(b_i \ne b_j\). There exists an edge \(e^\prime = ({b_i}, {b_{j - 1}})\) in one of the \(\ell \)\(\Gamma _{1}(\lceil \frac{n}{\ell }-1\rceil )\) subgraphs of \(G_1\) that is not yet used for an edge-embedding. This determines that the path in the next recursion step has to be between poles \(p_{b_i}\) and \( p_{b_{j - 1}}\). We denote with \(s \in {\mathcal {S}}\) the subgraph of \(G_1\) which contains \(e^\prime \) and x denotes its number in S, i.e., \({\mathcal {S}}[x] = s\). This implies in which of the \(\ell \) recursion step graphs we need to edge-embed the path from \(p_{b_i}\) to \( p_{b_{j - 1}}\), and so which recursion points we need to program. We first set the control bit of the xth input (resp. output) recursion points to 1 since the path between the poles with labeling i and j enters (resp. exits) the next recursion step over this recursion point. A special case to be considered here is when blocks \({\mathcal {B}}[b_i]\) and \({\mathcal {B}}[b_j]\) are neighbors (i.e., \(b_j = b_i + 1\)). Then, the path enters and leaves the next recursion step graph at the same node, whose control bit thus has to be 0. The output vector of block \({\mathcal {B}}[b_i]\) is the \(i^\prime \)th value to the xth recursion point, and the input vector of block \({\mathcal {B}}[b_j]\) is the xth value to the \(j^\prime \)th pole in this block.

We repeat these steps for all edges \(e \in E\). Since all input and output vector of all blocks in \({\mathcal {B}}\) are set, they can be embedded with the block edge-embedding. For all \(\ell \) subgraphs of \(G_1\) in the supergraph and in the EUG, we call the same procedure with \({\mathcal {S}}[i] \in {\mathcal {S}}\), \({\mathcal {R}}[i] \in {\mathcal {R}}\), \(1 \le i \le \ell \).

5 Extensions to Valiant’s UC Constructions

Here, we describe ideas for novel UC constructions and implementations. Firstly, in Sect. 5.1, we describe the k-way generalization of Valiant’s UC presented by Lipmaa et al. in [46]. In Sect. 5.2, we describe our modular building blocks for a potentially more efficient 3-way UC. We show that Valiant’s optimized \(U_3(\Gamma _1)\) cannot directly be applied as a building block in the construction due to the fact that it must have an additional node to be part of a generic EUG. We prove that the EUG without this node is not a valid EUG by showing a counterexample. Therefore, it actually results in a worse asymptotic size than Valiant’s 2-way and 4-way UCs [66]. Thereafter, in Sect. 5.3, we propose a hybrid UC, utilizing both Valiant’s 2-way and 4-way UCs or Valiant’s 2-way and Zhao et al.’s 4-way UC [72] so that the overall size of the resulting hybrid UC is minimized and is at least as efficient as the better construction for the given size (in Sect. 6.2 we show its concrete improvement). Finally, in Sect. 5.4, we propose a different modular and scalable approach of Valiant’s 4-way UC. This approach requires a lot of modifications in the UC generation and programming algorithm, but can be generalized to any k-way UC or to our hybrid UC.

5.1 Generalized k-way UC

In [46], Lipmaa et al. generalize Valiant’s approach by providing a UC with any number of recursion points k, the so-called k-way EUG or UC. We note that their construction slightly differs from Valiant’s EUG, since they do not consider the restriction on the fanout of the poles, i.e., the nodes in the EUG that correspond to universal gates or inputs (cf. Sect. 3.1). This optimization has also been included in [45] when translating an EUG to a UC, but including it in the block design leads to better sizes for the number of XOR gates. This, however, does not make a difference in case of our most prominent application of private function evaluation (PFE) (cf. Sect. 1.1), where XOR gates are free, i.e., do not require cryptographic operations and communication.

The idea is to split \(n = u + v + g\) in \(m = \lceil \frac{n}{k}\rceil \) blocks as shown in Fig. 6. Every block i consists of k inputs \(r_{i}^1, r_{i}^2, \ldots , r_{i}^k\) and k outputs \(r_{i+1}^1, r_{i+1}^2, \ldots , r_{i+1}^k\) as well as k poles, except for the last block which has a number of poles depending on n mod k. For every \(j \le k\), the list of all \(r_i^j\) builds the poles of the jth subgraph of the next recursion step, i.e., we have k subgraphs. Additionally, every block begins and ends with a Waksman permutation network [67] such that the inputs and outputs can be permuted to any pole. A Y-switching block is placed in front of every pole \(p_i\) which is connected to the ith output of the permutation network as well as the ith output of a block-intern EUG \(U_k(\Gamma _1)\). This means that Lipmaa et al. in [46] reduce the problem of finding an efficient k-way EUG \(U_n^{(k)}(\Gamma _2)\) block \(B^{(k)}\) to the problem of finding the smallest EUG \(U_k(\Gamma _1)\). Their solution is to build the block-intern EUG with the UC of [44], which was claimed to be more efficient for smaller circuits than [66]. Moreover, they calculate the optimal k value to be around 3.147 with their construction, which implies that the best solutions are found using small EUGs, for which Valiant provides hand-optimized solutions (i.e., for \(k=2, 3, 4, 5, 6\)) [66].

We note that the results recently presented by Zhao et al. [72] do not fit into this generalized k-way construction. Therefore, Zhao et al.’s optimized 4-way block is an optimization over Valiant’s modular 4-way block construction [66].

Fig. 6
figure 7

k-way EUG construction \(U_n^{(k)}(\Gamma _1)\) [46]

5.1.1 Programming the Generalized UC

In this section, we extend the recent work of [46] by providing a detailed and modular embedding mechanism for any k-way EUG construction. We provide the main differences to the edge-embedding of the 2-way and 4-way EUG detailed in Sect. 4.

k-way Block Edge-Embedding. In this setting, our main block is a programmable block \(B^{(k)}\) with k poles \(p_1, \ldots , p_k\), and k input [output] recursion points \(r_0^1, \ldots , r_0^k\) [\(r_1^1, \ldots , r_1^k\)]. \(B^{(k)}\) is topologically ordered with mapping \(\eta ^U\) as defined in Sect. 2.1. Vectors \(in =(in_1, \ldots , in_k) \in \{0, \ldots , k\}^k\) and \(out = (out_1, \ldots , out_k) \in \{0, \ldots , 2k - 1\}^k\) denote the input and output vectors of \(B^{(k)}\), respectively. Values \(k, \ldots , 2k-1\) in out denote the recursion point targets \(r_1^1, \ldots , r_1^k\) (cf. Sect. 4.1). The setting of in and out is formalized in Eqs. 26 when \(\ell =k\).

k-way Recursion Point Edge-Embedding.\(G \in \Gamma _2(n)\) denotes the transformed graph of a Boolean circuit \(C_{u, v}^g\), where \(n=u+v+g\).

  1. 1.

    Splitting \(G \in \Gamma _2(n)\) into two \(\Gamma _1(n)\) graphs \(G_1\) and \(G_2\): Similarly as in Sect. 4.2, we first split G into two \(\Gamma _1(n)\) graphs \(G_1\) and \(G_2\) with 2-coloring.

  2. 2.

    Merging a \(\Gamma _1(n)\) graph into a \(\Gamma _k(\lceil \frac{n}{k}-1\rceil )\) graph\(G_1 = (V, E) \in \Gamma _1(n)\) is merged into a \(\Gamma _k(\lceil \frac{n}{k}-1\rceil )\) graph \(G_{m} = (V^\prime , E^\prime )\) (same for \(G_2\)). Therefore, we redefine mapping \(\theta _V\) (cf. Sect. 4.2) that maps node \(v_i \in V\) to node \(v_j \in V^\prime \). In this scenario, k nodes in V build one node in \(V^\prime \), so \(\theta _V(v_i) = v_{\lceil \frac{i}{k}\rceil }\). The mapping of the edges \(\theta _E\) is the same as in the 2-way and 4-way EUG construction, and \((v_i', v_j')\in E'\) where \(j<i\) edges are removed along with \(v_{\lceil \frac{n}{k}\rceil }\) in the end. \(G_m\) is then a topologically ordered graph in \(\Gamma _1(\lceil \frac{n}{k}-1\rceil )\).

  3. 3.

    The supergraph for Lipmaa et al.’s k-way EUG construction The next step of the construction is to split \(G_{m} \in \Gamma _1(\lceil \frac{n}{k}-1\rceil )\) into k \(\Gamma _1(\lceil \frac{n}{k}-1\rceil )\) graphs. This is done with k-coloring: A directed graph \(K=(V, E)\) can be k-colored, if k sets \(E_1, \ldots , E_k\subseteq E\) cover the following conditions:

    $$\begin{aligned} \textsc {Disjoint}&\quad \forall i, j \in \{1, \ldots , k\}: \quad i \ne j \rightarrow E_i \cap E_j = \emptyset . \end{aligned}$$
    (10)
    $$\begin{aligned} \textsc {EdgeIn}E_i&\quad \forall e \in E: \quad \exists i \in \{1, \ldots , k\}: e \in E_i. \end{aligned}$$
    (11)
    $$\begin{aligned} \textsc {Fanin1}E_i&\quad \forall i \in \{1, \ldots , k\}, \forall e = (v_1, v_2) \in E_i: \nonumber \\&\quad \lnot \exists e^\prime = (v_3, v_4)\in E_i \setminus \{e\}: v_2 = v_4 \vee v_1 = v_3. \end{aligned}$$
    (12)

According to Kőnig’s theorem [38, 48] described in Sect. 2.1\(\Gamma _k(n)\) graphs can always be k-colored efficiently with a dedicated algorithm. The rest of the supergraph construction and the way it is used for edge-embedding is the same as for the 2-way and 4-way EUG as described in Sect. 4.2.

k-wayEdge-Embedding Algorithm. The edge-embedding algorithm is the same as shown in Listing 1, with \(\ell =k\).

5.2 Potentially More Efficient 3-Way UC

The optimal k value for minimizing the size of the k-way UC was calculated to be 3.147 in [46]. We describe our idea of a 3-way UC. Intuitively, based on an optimization by Valiant [66], this UC should result in the best asymptotic size. The asymptotic size of any k-way UC depends on the size of its modular body block \(B^{(k)}\) (e.g., Fig. 5a or b on p. 14 for the 4-way UC). Once it is determined, the size of the UC is \(\text {size}(U^{(k)}_n(\Gamma _2))=2\cdot \text {size}(U^{(k)}_n(\Gamma _1))\sim 2\cdot \frac{\text {size}(B^{(k)})}{k} n\log _k n = 2\cdot \frac{\text {size}(B^{(k)})}{k \log _2(k)} n\log _2 n\). The modular block consists of two permutation networks \(P^{(k)}\), an EUG \(U_k(\Gamma _1)\), and \((k-1)\) Y-switching blocks (cf. Sect. 5.1, [46]).Footnote 2

Size of Body Block\(B^{(3)}\)with Valiant’s Optimized\(U_3(\Gamma _1)\). According to Valiant [66], an EUG \(U_3(\Gamma _1)\) with 3 poles contains only three-connected poles (used as recursion base in Sect. 3.1). An optimal permutation network \(P^{(3)}\) that achieves the lower bound has 3 nodes as well. This implies that size\((B^{(k)})=2\cdot P^{(3)}+\text {size}(U_3(\Gamma _1))+(3-1) = 11\). Then, the size of the UC becomes \({\sim }\,2\cdot \frac{11}{3\log _2 3} n\log _2 n \sim 4.627 n \log _2 n\), which means an asymptotically by around 2.5% smaller size than that of Valiant’s 4-way UC with \({\sim }\,4.75 n\log _2 n\).

However, there is a flaw in this initial design. Valiant’s \(U_3(\Gamma _1)\) only works as an EUG for 3 nodes under special conditions, e.g., when it is a subgraph within a larger EUG. There are 3 possible edges in a topologically ordered graph \(G=(V, E)\) in \(\Gamma _1(3)\): (1, 2), (2, 3) and (1, 3). (1, 2) and (2, 3) can be directly embedded in \(U_3(\Gamma _1)\) using \((p_1, p_2)\) and \((p_2, p_3)\), respectively. (1, 3), however, has to be embedded as a path through node 2, i.e., as a path \(((p_1, p_2), (p_2, p_3))\). When \(U_3(\Gamma _1)\) is a subgraph of a bigger EUG, this is possible by programming \(p_2\) accordingly. However, when we use this \(U_3(\Gamma _1)\) as a building block in the body block of our EUG, it cannot directly be applied, due to the fact that the programming of \(p_2\) depends on other constraints as well. A generic \(U_3(\Gamma _1)\) that can embed (1, 3) without going through \(p_2\) as before has an additional Y-switching block between \(p_2\) and \(p_3\).

We depict in Fig. 7a the 3-way body block that uses Valiant’s optimized \(U_3(\Gamma _1)\) in the k-way block design of [46] and show that it is not a valid body block for an EUG construction. Assume that the output of pole \(p_{3i+1}\) has to be directed to pole \(p_{3i+3}\) (green path). Then, it needs to go through pole \(p_{3i + 2}\), which means that the red edge going to \(p_{3i + 2}\) is used by this path. However, there can be an other edge coming from the permutation network as an input to \(p_{3i + 2}\), e.g., from \(p_{3i}\) from the preceding block through \(r_i^1\) (blue path). This cannot be directed to \(p_{3i+2}\) anymore, as shown in Fig. 7a, since the red edge would carry two different values. Therefore, in the 3-way body block construction, it does not suffice to use Valiant’s optimized \(U_3(\Gamma _1)\) [66].

Size of Body Block\(B^{(3)}\)with Our Generic\(U_3(\Gamma _1)\). In Fig. 7b, we show the 3-way body block with the generic \(U_3(\Gamma _1)\) that allows the output from \(p_{3i+1}\) to be directed to \(p_{3i+3}\) without having to go through \(p_{3i+2}\) (green path), and the edge going into \(p_{3i+2}\) can be utilized by the path directed into this node (blue path). This results in size\((B^{(3)}) = 2\cdot P^{(3)} + \text {size}(U_3(\Gamma _1)) + (3-1) = 12\), which implies that the size of the UC is \({\sim }\,2\cdot \frac{12}{3\log _2 3} n \log _2 n =5.047n\log _2 n\). Unfortunately, this is even worse than the size of the 2-way UC with \({\sim }\,5n\log _2n\), and we therefore conclude that the most efficient known UC is Valiant’s 4-way UC with Zhao et al.’s optimization.

Fig. 7
figure 8

Body block \(B^{(3)}\) construction for our 3-way EUG \(U_n^{(3)}(\Gamma _1)\)

Recently, Zhao et al. [72] have shown by exhaustive search over all possible topologies that the 3-way body block \(B^{(3)}\) presented in Fig. 7b results in the smallest 3-way UC by showing that no block with only 11 additional nodes can be used as a universal block, and indeed, our block with 12 additional nodes can be utilized.

5.3 2/4 Hybrid UC Construction

In this section, we detail our hybrid UC based on Valiant’s 2-way and 4-way UCs with the optimization by Zhao et al. [72], which yields the smallest UCs to date. Given the size of the input circuit \(C_{u, v}^g\), i.e., \(n=u+v+g\), we can calculate at each recursion step if it is better to create 2 subgraphs of size \(\lceil \frac{n}{2}-1\rceil \) and utilize the 2-way recursive skeleton, or it is more beneficial to create a 4-way recursive skeleton with 4 subgraphs of size \(\lceil \frac{n}{4}-1\rceil \).

We assume that for every n, we have an algorithm that computes the size (i.e., size\((U_n^{\text {hybrid}(K)}(\Gamma _1))\)) of the hybrid UC for sizes smaller than n. We give details on how it is computed in Sect. 6. Then, Listing 2 describes the algorithm for constructing a hybrid UC, at each step based on which strategy is more efficient. We note that our hybrid construction is generic, and given multiple k-way UCs as parameter K (\(K=\{2, 4\}\) in our example), it minimizes the concrete size of the resulting UC.

figure c

5.4 Scalable 4-way UC Construction

Our existing implementations of [31, 45] store the whole UC of size \({\mathcal {O}}(n\log n)\) in memory, which therefore becomes a bottleneck when it comes to scalability. In this section, we present the design of our scalable universal circuit construction. Specifically, we show how Valiant’s 4-way UC can be modified to use \({\mathcal {O}}(n)\) memory in the input circuit size n at each step of the execution. We note that our approach is generic, and with additional implementation effort, it can be extended to any k-way UC as well as for the 4-way UC of Zhao et al. [72].

In this section, we present our design that utilizes two separate phases. The first phase is scalable UC generation (Sect. 5.4.1), where the universal circuit is generated given the size n of the input circuit. This is solved by generating the topologically ordered UC layer by layer, each of which has size \({\mathcal {O}}(n)\). The output of this step is a set of circuit files, which all contain a subgraph of size \({\mathcal {O}}(n)\), which helps to significantly reduce the complexity of the second phase, i.e., scalable UC programming (Sect. 5.4.2). In this step, the subcircuits resulting from the first phase are programmed individually, i.e., we proceed subcircuit by subcircuit instead of edge by edge of the input circuit as before. Therefore, the output of this step is a set of programming files that contain the programming bits respective to the circuit files. In Sect. 7.2, we will show experimentally that our scalable UC construction significantly reduces the memory usage.

5.4.1 Scalable Per-Block UC Generation

The underlying idea behind our scalable UC generation is to generate the blocks of the main skeleton one by one, only keeping one such block and its corresponding subgraph nodes in memory at once. In this scenario, these blocks will be regarded as layers. Additionally, we store some necessary information from the preceding three layers in dedicated files, but delete these as soon as they become redundant. The required additional information is the topological order of nodes that are already defined and have edges directed into the current layer. Since the number of subgraphs in any layer is \({\mathcal {O}}(n)\), the number of nodes held in memory at any point is \({\mathcal {O}}(n)\) as well, since in each layer there are only a constant number of nodes.

Fig. 8
figure 10

Scalable body block construction. a The first part \(B^0=B^0_4\) of the body block, c the second \(B^1=B^1_4\)b the third \(B^2=B^2_4\), and d the fourth \(B^3=B^3_4\), where further subgraphs are created. We note that the nodes are shown only for one of the four subgraphs, but they are the same for all four subgraphs. Scalable head and tail blocks are designed analogously

Our scalable UC generation relies on the fact that at each block of the main skeleton, based on the modulo 4 result for each next recursion step, we know which part of the next subgraph skeleton or potentially recursion base graph we build at each layer. This observation helps us reconstruct how the subgraphs may look like for a given body block in Valiant’s 4-way UC. Since the structure of this is complicated and there are many cases to consider, we show in Fig. 8 the cases for Valiant’s body block from Fig. 5a on p. 14 [66] and note that head and tail blocks can be constructed analogously. Moreover, a similar scalable design can be constructed for Zhao et al.’s body block (Fig. 5b) [72].

Table 1 Files storing the UC in our scalable UC generation for an example with \(n=36\)

Figure 8d shows a recursive block construction with Figs. 8b, c being base cases. From Fig. 8, each body block construction type is denoted by \(B^i\) where \(i=\{0, 1, 2, 3\}\)Footnote 3 is the position of nodes between two poles in a body block in the subgraph. A given subgraph has node(s) between every two set of recursion points of the parent graph to which this subgraph belongs. We know that the recursion points, for instance \(\{r^1_1, \ldots , r^1_{\lceil \frac{n-4}{4}\rceil }\}\), are the poles of the next recursion step subgraph. Analogously, we can design head \(H^i\), tail \(T^i_x\), and special last body blocks \(B^i_x\), where \(x=\{1, 2, 3, 4\}\) denotes the type of the body or tail block based on the number of input or output recursion points, respectively. In the following, we use an example to detail how our scalable UC generation works. We depict the resulting UC files and what their content is in Table 1.

Generation of first (main) skeleton. Generating the first (main) skeleton of the two \(U_n(\Gamma _1)\) EUGs that are merged into a \(U_n(\Gamma _2)\), EUG differs from the next, recursive steps. Let us consider an example of a DAG with \(n=u + k + v = 36\). Ideally, our approach constructs twice the same block from the left and right \(U_n(\Gamma _1)\) EUGs. In this scenario for \(U_n(\Gamma _1)\), we have one (merged) head block H, seven (merged) body blocks B, and one (merged) tail block \(T_4\) with 4 nodes in the main skeleton. Constructing the first head block is straightforward according to [31, Fig. 4e] as we do not have to construct any subgraph. Thereafter, we construct seven body blocks according to Fig. 5a and a tail block according to [31, Fig. 4f]. However, these merged blocks require constructing the subgraph nodes in the same layer alongside with it, as we describe next. Note that in this first step, we actually generate twice the four sets of subgraph nodes, since the two \(U_n(\Gamma _1)\) EUGs are merged into a \(U_n(\Gamma _2)\) EUG (cf. Sect. 3.1), but in later recursion steps, only four sets of subgraph nodes are generated.

Generating subgraph nodes recursively per layer. We can generate the subgraph nodes recursively for all recursion steps at a given position for nodes n. In our example with \(n=36\), we only have a head and a tail block for the recursion graph with \(\lceil \frac{n-4}{4}\rceil =8\) poles. Therefore, we construct the first body block with \(H^0\) as subgraph level, the second body block with \(H^1\), thereafter \(H^2\) and \(H^3\). The fifth body block is constructed with \(T^0\), the sixth and seventh with \(T^1\) and \(T^2\), respectively, and the tail block with \(T^3\). Recursive scalable blocks are \(H^3\) and \(B^3\) as shown in Fig. 8d. \(T_4^3\) does not have recursion points anymore, since a tail block has no output recursion points. For \(n=8\), we reach a recursion base with \(\lceil \frac{n-4}{4}\rceil =1\). However, for a larger n, more recursion steps might be necessary. Therefore, at each layer, we generate all subgraph nodes necessary, and if a recursion step, i.e., \(H^3\) or \(B^3\), occurs, we generate the nodes of the next subgraph as well, etc. We denote the recursion bases by \(R_1, R_2, R_3\), and \(R_4\) with 1, 2, 3, and 4 nodes, respectively.

With this, we have shown how to generate topologically ordered universal circuits using the file system and achieve a scalable algorithm for UC generation that stores at most \({\mathcal {O}}(n)\) information in memory. Moreover, our approach requires \(4.75 n\log _2 n\) disk space to store the universal circuit as before, and additionally \({\mathcal {O}}(n)\) extra storage space for every layer. However, we only store additional data for the prior three layers and delete any other stored data at each step. In the end of the UC generation, we can delete any additionally stored data. The maximum storage requirement for our algorithm is before deleting the additionally stored data for the last layer, since the size of the UC dominates the storage requirements at any other step (when only a part of it is generated yet).

5.4.2 Scalable UC Programming

As described in Sect. 5.4.1, we design our scalable UC generation such that each subgraph is written into a separate file. This is important to also allow the programming step to require only \({\mathcal {O}}(n)\) memory. It can be observed in Listing 1 on p. 17 that the recursion point edge-embedding algorithm inherently handles the UC subgraph by subgraph (cf. Sect. 4.2), which in turn calls the block edge-embedding for all blocks in a subgraph. We observe that each skeleton can be programmed based on the information stored only in the corresponding \(\Gamma _1\) graph, and therefore, we can store the programming bits in a separate file for each subgraph in the same order as the nodes of the subgraph.

After reading a subgraph from its file resulting from the UC generation step detailed in Sect. 5.4.1, it is programmed as described in Listing 1. The embedding starts from the main skeleton in file \(f^0\) and continues with \(f^1, \ldots , f^4\) and \(g^1, \ldots , g^4\), etc., and results in the corresponding programming files \(p^0\), \(p^1, \ldots , p^4\) and \(q^1, \ldots , q^4\), etc.

6 Size and Depth of UCs

In this section, we review the size and depth of the UCs considered in this article. The size of the edge-universal graph\(U_n^{(k)}(\Gamma _1)\) is the number of nodes, counting all the poles and nodes created using Valiant’s construction from Sect. 3.1. The depth of the edge-universal graph is the number of nodes on the longest path between any two nodes, i.e., essentially the path between the first input and last output. \(U_n^{(k)}(\Gamma _2)\) is built from two \(U_n^{(k)}(\Gamma _1)\) edge-universal graphs as described in Sect. 3.1. When transforming \(U_n^{(k)}(\Gamma _2)\) into a UC, the first u poles are associated with inputs, the last v poles with outputs, and the \(g\) poles between are realized with universal gates (cf. Eq. 1 on p. 11) whose programming is defined by the corresponding gates in the simulated circuit. The rest of the nodes of \(U_n^{(k)}(\Gamma _2)\) are translated into universal programmable (X and Y) switching blocks (cf. Fig. 2 on p. 11), whose programming is defined by the edge-embedding of the graph of the circuit G into \(U_n^{(k)}(\Gamma _2)\). Thus, when considering the sizes and depths of the UCs, we realize the nodes and poles as circuit building blocks and express the concrete and asymptotic sizes in the number of switches (X and Y) and universal gates (U) (cf. Sect. 3.2).

In Sect. 6.1, we recapitate the asymptotic size and depth of Valiant’s 2-way and 4-way UCs [66], i.e., UCValiant-2 and UCValiant-4, respectively, of Zhao et al.’s 4-way UC UCZhao et al.-4 [72] and of the smallest k-way UCs following Lipmaa et al.’s generalization [46]. Thereafter, in Sect. 6.2, we present optimizations that reduce the size (and potentially the depth as well) of UCs, regardless of which constructions were used for their generation. We revise the concrete sizes and depths of UCValiant-2 and UCValiant-4, UCZhao et al.-4 as well as that of our 2/4 hybrid UCs UCH(Valiant-2,4) and UCH(Valiant-2, Zhao et al.-4) (cf. Sect. 5.3).

6.1 Asymptotic Size and Depth of k-Way UCs

Lipmaa et al.’s k-way UC [46] is discussed briefly in Sect. 5.1 and is depicted in Fig. 6 on p. 19. They show that a k-way body block may consist of two permutation networks \(P^{(k)}\), an EUG for k nodes, i.e., \(U_k(\Gamma _1)\), and additionally, \((k-1)\) Y-switching blocks. In this section, we recapitulate the sizes in Table 2 and depths in Table 3 of these building blocks and give an estimate for the leading constant for Lipmaa et al.’s k-way EUGs and UCs with size \({\mathcal {O}}(n\log _2n)\) and depth \({\mathcal {O}}(n)\), for \(k\in \{2, \ldots , 8\}\). We conclude that among all UCs following this generalization, the best size is achieved by Valiant’s 4-way UC, UCValiant-4. This does not exclude the possibility for a more efficient UC, as has been shown in [72], where Zhao et al. propose a 4-way UC, UCZhao et al.-4, using a smaller body block. Therefore, their construction achieves the smallest asymptotic size to date. However, Zhao et al. state that their method cannot be used yet to find more efficient UCs for \(k>4\), since it includes an exhaustive search for which the domain becomes too large.

Table 2 Leading term of the asymptotic \({\mathcal {O}}(n \log _2 n)\)sizes of k-way edge-universal graphs \((U_n^{(k)}(\Gamma _1))\) and universal circuits (UC) and the concrete size of their building blocks for \(k\in \{2, \ldots , 8\}\) according to the design of [46]

6.1.1 Edge-Universal Graph with k Poles

Size. Valiant optimized EUGs up to size 6 by hand in [66]: For \(k=2\), \(U_2(\Gamma _1)\) has two poles, for \(k=3\) we discussed in Sect. 5.2 that an additional node is necessary. For \(k\in \{4, 5, 6\}\), the sizes are \(\{6, 10, 13\}\), as shown in [45, Fig. 1] (the nodes denoted as empty circles disappear in the UC). For \(k=7\) and \(k=8\), we observe that UCValiant-2 results in a better size than that of UCValiant-4 due to the smaller permutation network and less recursion nodes. Therefore, we use these constructions to compute the size of \(U_7(\Gamma _1)\) and \(U_8(\Gamma _1)\). As mentioned in [46], another possibility is to use the UC of [44] instead of these EUGs since they have better sizes for small circuits. These UCs \(U_{k}^\text {KS08} \) are built from two smaller \(U_{\frac{k}{2}}^\text {KS08} \), a \(P^{(\frac{k}{2})}\) and \(\frac{k}{2}\) Y switches [44]. It results in a smaller size of 21 for \(k=8\).

Depth. The depth of the hand-optimized EUGs for \(k\in \{2, 3, 4, 5, 6\}\) is, respectively, \(\{2, 4, 5, 7, 10\}\) as shown in [45, Fig. 1]. The depth of \(U_7(\Gamma _1)\) and \(U_8(\Gamma _1)\) becomes, respectively, 16 and 19 with Valiant’s 2-way UC, and 14 and 16 with the UC from [44].

6.1.2 Permutation Networks \(P^{(k)}\)

Size. Waksman in [67] showed that the lower bound for the size of a permutation network is \(\lceil \log _2(k!) \rceil \) for k elements. We show this lower bound in Table 2 as \(P^{(k)}_\text {l}\). The size of the smallest existing permutation network is Waksman’s permutation network \(P^{(k)}_\text {W}\) [7, 67]. For \(k\in \{2, 3, 4\}\), its size matches the lower bound, but for larger values of k\(P^{(k)}_\text {W}\) uses additional nodes.

Depth. The depth of a permutation network has lower bound \(\lceil \log _2(k!) \rceil +1\), since each input has to have a path to each output, where switches have only two inputs and two outputs. We show these as the depth of \(P^{(k)}_\text {l}\) in Table 3. Waksman’s permutation network matches the lower bound when \(k\in \{2, 3, 4\}\), but utilizes additional nodes for larger values of k.

6.1.3 Body Blocks

A body block \(B^{(k)}\) is built of \((k-1)\) Y-switching blocks, an EUG for k nodes, and two permutation networks \(P^{(k)}\) [46] (cf. Fig. 6 on p. 19). \(B^{(k)}\) shown in Tables 2 and 3 is built using Waksman’s permutation network \(P_\text {W}^{(k)}\).

Size. The size of the body block is the sum of the sizes of its building blocks, i.e., \(\text {size}(B^{(k)})=\min \left( \text {size}(U_k(\Gamma _1)), \text {size}(U_{k}^\text {KS08})\right) +2\cdot \text {size}(P^{(k)}) + (k-1)\cdot \text {size}(Y).\)

Depth. The depth of \(B^{(k)}\) is the number of edges in its building blocks, the additional edges between the different blocks and the recursion nodes. This means that in total \(\text {depth}(B^{(k)})=\min \left( \text {depth}(U_k(\Gamma _1)),\text {depth}(U_{k}^\text {KS08})\right) +2\cdot \text {depth}(P^{(k)}) + (k-1)\cdot \text {depth}(Y) + 1.\)

6.1.4 Edge-Universal Graphs and Universal Circuits with n Poles

Two k-way EUGs \(U_n^{(k)}(\Gamma _1)\) graphs build up an EUG \(U_n^{(k)}(\Gamma _2)\) as described in Sect. 3.1.

Table 3 Leading terms of the asymptotic \({\mathcal {O}}(n)\)depths of k-way edge-universal graphs \((U_n^{(k)}(\Gamma _1))\) and universal circuits (UC) and the concrete depth of their building blocks for \(k\in \{2, \ldots , 8\}\) according to the design of [46]

Size. The asymptotic size of EUG \(U_n^{(k)}(\Gamma _1)\) is determined as \(\text {size}(U_n^{(k)}(\Gamma _1))=\frac{\text {size}(B^{(k)})}{k\log _2 k}n\log _2 n\). The leading factor for a size(UC) is twice this number, since asymptotically, the number of switches in the UC is the same as the number of nodes in \(U_n^{(k)}(\Gamma _2)\), which is summarized in Table 2. We use Waksman’s permutation network \(P^{(k)}_\text {W}\) when calculating the size of the UC, however, even with the lower bound \(P^{(k)}_\text {l}\), for \(k\in \{5, 6, 7, 8\}\) we have the respective leading terms \(\{4.824, 4.900, 5.190, 5\}\), which are larger than 4.75 for \(k=4\). The last column of Table 2 shows that the smallest UC sizes are achieved in order by Zhao et al.’s optimized UC UCZhao et al.-4, Valiant’s 4-way (\(k=4\)) UCValiant-4 and 2-way UCs (\(k=2\)) UCValiant-2.

Depth. The depths of the EUG and of the UC depend only on the depth of the main skeleton, not on the subgraphs, since the longest path is between \(p_1\) and \(p_n\) in the outest skeleton. Therefore, the asymptotic depths of EUG \(U_n^{(k)}(\Gamma _1)\) and the corresponding UC are calculated as \(\frac{\text {depth}(B^{(k)})}{k}\), as shown in the last column of Table 3. With the lower bound \(P^{(k)}_\text {l}\) for \(k\in \{5, 6, 7, 8\}\), we have the respective leading terms \(\{4, 4, 4.14, 4\}\), which are larger than for \(k=2\) and \(k=4\). The UC depth is minimal for Valiant’s 2-way UCValiant-2  (\(k=2\)), followed by Zhao et al.’s 4-way UC UCZhao et al.-4 and Valiant’s 4-way UCValiant-4  (\(k=4\)) as shown in Table 3.

6.2 Concrete Size and Depth of UCs

In this section, we consider formulae for the concrete sizes and depths of Valiant’s UCs, i.e., UCValiant-2 and UCValiant-4 [66], Zhao et al.’s method UCValiant-4 [72], and our hybrid universal circuits UCH(Valiant-2,4) [31] and UCH(Valiant-2, Zhao et al.-4). Beforehand, we describe two optimizations.

6.2.1 Optimization for Fanin-1 Nodes

We observe that in \(U_n^{(k)}(\Gamma _1)\) there is a fanin-1 node in the head block (cf. [31, Fig. 2c and 4e] for UCValiant-2 and UCValiant-4, respectively). A similarly designed head block for Zhao et al.’s optimized UCZhao et al.-4 [72] has three such fanin-1 nodes (cf. in Fig. 19a in “Appendix B”). Moreover, fanin-1 nodes exist in the base cases for a small number of poles as well [45]. These nodes are important to achieve fanin and fanout 2 of the graph, but can be replaced with wires when translated into a circuit description as described in Sect. 3.2. Since at least one such node can be ignored in each subgraph when nodes are translated into gates, this results in at least \(k\cdot \left( \sum _{i = 0}^{\log _k n - 1} k^i \right) {\sim }\, kn\) less gates for the universal circuit, where \(n=u+v+g\). We include this optimization in our calculations further on. This improvement decreases the depth of the UC only by a few gates.

6.2.2 Optimization for Input and Output Nodes

In the skeleton of Valiant’s UC, the poles corresponding to circuit inputs need no ingoing edges and those corresponding to circuit outputs need no outgoing edges. Therefore, since uv and \(g\) are publicly known, we optimize by deleting nodes that become redundant while canceling the edges going to the first u (input) and coming from the last v (output) nodes. The exact number of redundant switching nodes depends on the parity or modulo 4 of uv, \(n=u+v+g\), and the k-way UC, but is \({\mathcal {O}}(u + v)\) in both \(\Gamma _1(n)\) edge-universal graphs that build up the graph of the UC. This optimization also improves the depth by \({\mathcal {O}}(u + v)\).

6.2.3 Concrete Sizes and Depths of 4-way and 2-way UCs

We realize that based on the parity (2-way UC) and the remainder modulo 4 (4-way UC), not only the size of the outest skeleton, but also that of the smaller subgraphs can be optimized by introducing so-called head and tail blocks (cf. Sect. 3.3 and Sect. 3.4). We considered this in our 2-way UC in [45], and we now generalize the approach for k-way UCs. We provide a recursive formula for the concrete size of the optimized k-way EUG as follows. Let \(m_k\) be

$$\begin{aligned} m_k {:}{=} {\left\{ \begin{array}{ll} n \mod k &{}\quad \text {if } k \not \mid n,\\ k &{}\quad \text {if } k \mid n. \end{array}\right. } \end{aligned}$$
(13)

Then, given the designed head, body, and tail blocks (cf. [31, Figs. 2 and 4]) with sizes and depths shown in Table 4, we can compute the size by calculating the sizes of all the components of the outest skeleton, and the sizes of the smaller subgraphs with the recursive formula in Eq. 14.Footnote 4

$$\begin{aligned} \text {size}(U_n^{(k)}(\Gamma _1))= & {} \text {size}(H^{(k)}(k)) + \left( \left\lceil \frac{n}{k}\right\rceil -3\right) \cdot \text {size}(B^{(k)}(k)) + \text {size}(B^{(k)}(m_k)) \nonumber \\&+\, \text {size}(T^{(k)}(m_k)) +\, m_k \cdot \text {size}\left( U^{(k)}_{\left\lceil \frac{n}{k}-1\right\rceil }(\Gamma _1)\right) \nonumber \\&+ (k-m_k) \cdot \text {size}\left( U^{(k)}_{\left\lfloor \frac{n}{k}-1\right\rfloor }(\Gamma _1)\right) . \end{aligned}$$
(14)
Table 4 The sizes and depths of building blocks of the 2-way and 4-way UCs (cf. Figs. 35a, b on p. 12–14, [31, Figs. 2 and 4], Figs. 19a, b in “Appendix B”), including the fanin-1 optimization from Sect. 6.2.1

As described in Sect. 3.1, a UC is constructed by means of an EUG \(U_n^{(k)}(\Gamma _2)\), which is in turn constructed from two EUGs with fanin and fanout one, \(U_n^{(k)}(\Gamma _1)\), by merging their poles together and thus taking them only once into consideration. When constructing a UC for circuit \(C_{u, v}^g\), the number of inputs u, the number of outputs v, and the number of gates \(g\) with fanin and fanout 2 are public. Thus, using Valiant’s construction, \(U_n^{(k)}(\Gamma _2)\) with \(n=u + v + g\) poles is constructed, and thus, our formula for the concrete size of \(U_n^{(k)}(\Gamma _2)\) corresponding to \(C_{u, v}^g\) is

$$\begin{aligned} \text {size}(U_n^{(k)}(\Gamma _2))=2 \cdot \text {size}(U_{n}^{(k)}(\Gamma _1)) - n, \end{aligned}$$
(15)

and the size of the UC is

$$\begin{aligned} \text {size}( UC _{n}) \le (\text {size}(U_n^{(k)}(\Gamma _2)) - n)\cdot \text {size}(X) + g\cdot \text {size}(U), \end{aligned}$$
(16)

where X, Y, and U denote X-, Y-switching blocks and universal gates (cf. Sect. 3.2), respectively, and \(\text {size}(Y)\le \text {size}(X)\le \text {size}(U)\).

The depth of a k-way UC also depends on \(m_k\), the head, tail and body blocks (cf. [31, Figs. 2 and 4]), but not on the subgraphs. Thus, it is calculated using the formula in Eq. 17.

$$\begin{aligned} \text {depth}(U^{(k)}_n(\Gamma _1))= & {} \text {depth}(H^{(k)}(k)) + \left( \left\lceil \frac{n}{k}\right\rceil -3\right) \cdot \text {depth}(B^{(k)}(k)) \nonumber \\&+\text {depth}(B^{(k)}(m_k)) + \text {depth}(T^{(k)}(m_k)). \end{aligned}$$
(17)

Since \(\text {depth}(U^{(k)}_n(\Gamma _2))=\text {depth}(U^{(k)}_n(\Gamma _1))\), the depth of the UC is

$$\begin{aligned} \text {depth}(UC_n)\le (\text {depth}(U^{(k)}_n(\Gamma _2))-n)\cdot \text {depth}(X) + g\cdot \text {depth}(U), \end{aligned}$$
(18)

where \(\text {depth}(Y)\le \text {depth}(X)\le \text {depth}(U)\).

6.2.4 Concrete Size and Depth of Our 2/4 Hybrid UC

In Sect. 5.3, we provide a construction for minimizing the concrete size of the resulting 2/4 hybrid UC. The construction chooses at each step the skeleton that results in the smallest size. We provide the formula for determining its size using a dynamic programming algorithm in Eq. 19. Size(\(H^{(k)}{(i)}\)), size(\(T^{(k)}(i)\)) and size(\(B^{(k)}(i)\)) are values from Table 4 for \(k = 2\) and \(k = 4\). Its depth is the depth of the outest skeleton, either of the 4-way or 2-way UC, depending on which is chosen first.

$$\begin{aligned} \text {size}(U_n^{\text {hybrid}(K)}(\Gamma _1))= & {} \min \left( \text {size}(H^{(k)}{(k)}) + \left( \left\lceil \frac{n}{k}\right\rceil -3\right) \cdot \text {size}(B^{(k)}(k)) \right. \nonumber \\&\left. +\, \text {size}(B^{(k)}(m_k)) \right. +\, \text {size}(T^{(k)}(m_k))\nonumber \\&+ m_k \cdot \text {size}\left( U^{\text {hybrid}(K)}_{\left\lceil \frac{n}{k}-1\right\rceil }(\Gamma _1)\right) \nonumber \\&+\, (k-m_k) \cdot \text {size}\left( U^{\text {hybrid}(K)}_{\left\lfloor \frac{n}{k}-1\right\rfloor }(\Gamma _1)\right) ;\nonumber \\&\quad \left. k \in K = \{2, 4\}\right) . \end{aligned}$$
(19)

6.2.5 Improvements in Size over Valiant’s 2-way UC

Figure 9 shows the concrete improvement in percentage of UCValiant-4 and UCZhao et al.-4 over UCValiant-2 up to ten million nodes in the simulated input circuit. All reported averages are for the interval \(n \in \{15, \ldots , 10^7\}\). From the asymptotic leading factors in Table 2, we expect an improvement of up to \(5\%\) for UCValiant-4 and up to \(10\%\) for UCZhao et al.-4. In Table 5, we depict the minimum, average, and maximum improvement compared to the asymptotic improvement in the interval \(n \in \{2, \ldots , 10^7\}\). For the smallest n values (\(n\le 15\)), UCValiant-2 is better than both 4-way UCs. However, with growing values of n, the 4-way UCs are better, except for some short intervals as shown in Fig. 9. However, Valiant’s and Zhao et al.’s 4-way UCs always outperform Valiant’s 2-way UC for \(n\ge {10\,996}\) and \(n\ge {172}\), respectively, the average improvement being 2.97% and 7.65%, and the biggest improvement being 3.78% and 8.88%.

Fig. 9
figure 11

Improvement in size in percentage of our 2/4 hybrid, the 4-way UCs of [66, 72] over Valiant’s 2-way UC for \(15 \le n\le 10^7\) with logarithmic x axis. We note that the different graphs are in the same order as in the legend

The improvement of our UCH(Valiant-2,4) and UCH(Valiant-2, Zhao et al.-4) (cf. Sect. 5.3) is depicted in the same Fig. 9 and summarized in Table 5. For some n values, our hybrid UCs achieve the same size as the 2-way or corresponding 4-way UCs, but due to their nature, their improvement is always nonnegative, and greater than or equal to the improvement achieved by the 4-way UC. Moreover, in most cases our hybrid UCs result in better sizes than the underlying 4-way UC, which means that some subgraphs are created for an n for which the 2-way UC is smaller. The overall improvement over UCValiant-2 for all \(n \in \{2, \ldots , 10^7\}\) values of our UCH(Valiant-2,4) is on average 3.41% and at most 4.00%, and for our UCH(Valiant-2, Zhao et al.-4) is on average 7.71% and at most 8.88%.

We note that our hybrid UC can also be used to reduce the depth of the UC by utilizing the 2-way UC, UCValiant-2, in the first step of the construction. This results in the smallest asymptotic depth \({\sim }\,3n\) (cf. Table 3).

Table 5 Minimum, average, maximum, and expected asymptotic improvement in size of our 2/4 hybrid and the 4-way UCs of [66, 72] over Valiant’s 2-way UC in the range \(15 \le n \le 10^7\)

7 Implementation and Evaluation of Our UC Compiler

In this section, we detail the challenges faced while demonstrating the practicality of Valiant’s and Zhao et al.’s universal circuits. We show how to construct a UC and program it according to a standard circuit description. We validate our results with a practical implementation that, upon receiving a fanin-2 circuit \(C_{u, v}^{{\tilde{g}}}\) as input, outputs the corresponding 2-way or 4-way UC UCValiant-2, UCValiant-4 or UCZhao et al.-4 and its programming \(c^f\). We have provided the first implementation of Valiant’s 2-way UC of size \({\sim }\,5 n \log _2 n\) in [45] and implemented Valiant’s 4-way UC of smaller size \({\sim }\,4.75 n\log _2n\) in a modular way in [31].

In this work, we extend our implementation with the modular 2-way UC and include the optimized 4-way UC of Zhao et al. [72] with size \({\sim }\,4.5 n\log _2n\). We then combine the modular 2-way UC with both 4-way UCs in an implementation of our hybrid UC proposed in [31] and Sect. 5.3, i.e., UCH(Valiant-2,4) and UCH(Valiant-2, Zhao et al.-4), respectively. Moreover, we provide a prototype implementation of our scalable 4-way UC from Sect. 5.4, which can be generalized to both the 2-way UC and Zhao et al.’s improvement.

Fig. 10
figure 12

Our universal circuit compiler

7.1 UC Compiler

The architecture of our UC compiler is depicted in Fig. 10. In this section, we briefly describe its different artifacts and its use of the Fairplay [51] or CBMC-GC [10, 23] frameworks as a frontend. For a more detailed description, the reader is referred to [45]. Our implementation is available online at https://encrypto.de/code/UC.

  1. 1.

    Compiling Input Circuits from High-Level Functionality. We can use the Fairplay compiler [11, 51] with the FairplayPF extension [44] or the CBMC-GC compiler [10, 23] to translate the functionality described in a high-level language to the Fairplay circuit description called Secure Hardware Definition Language (SHDL). These compilers output a circuit \(C_{u, v}^{{\tilde{g}}}\) with fanin 2, which is required for all UCs. However, due to Valiant’s design, the input circuit  \(C_{u, v}^g\) to our UC compiler has to have fanout 2 as well, i.e., the outputs of all gates and inputs can only be used as the input of at most two subsequent gates. This can be achieved using copy gates such that instead of \({\tilde{g}}\) gates, we have \({\tilde{g}}\le g\le 2{\tilde{g}}+ v\) fanout-2 gates (cf. Sect. 2.2). We give concrete examples in [45] on how this conversion affects the size of practical circuits and show that in most cases, the resulting number of gates remains significantly below the upper bound \(2{\tilde{g}}+ v\).

  2. 2.

    Obtaining the \(\Gamma _2(n)\) Graph  G of the Circuit  \(C_{u, v}^g\). As next step, we transform circuit \(C_{u, v}^g\) into a \(\Gamma _2(n)\) graph \(G = (V, E)\) with \(n=u+v+g\) (cf. Sect. 3.1). This can directly be generated as described in Sect. 2.2: With the number of inputs u, outputs v, and gates \(g\) in circuit \(C_{u, v}^g\), G has n nodes and the wires are represented as edges in the graph. Then, we define a topological order \(\eta ^G\) on the nodes of G such that every input node \(v_i\) has a topological order of \(1 \le \eta ^G(v_i) \le u\) and every output node \(v_j\) is labeled with \(u+g+1\le \eta ^G(v_j) \le u + v + g\). Since \(C_{u, v}^g\) has fanin and fanout 2, the resulting graph G is in \(\Gamma _2(n)\), where \(n=u+v+g\). It is possible in the modified SHDL circuit description that an internal value becomes two times the first or two times the second input of gates. Therefore, when a value is the second time the same input to a gate (i.e., first or second), both the two inputs and the two middle bits of the function table of the gate must be reversed (i.e., to compute \(f(\text {in}_1, \text {in}_2)\) instead of \(f(\text {in}_2, \text {in}_1)\)) for the correct programming of the UC in Step 5.

  3. 3.

    Generating Edge-Universal Graph \(U_n^{(\ell )}(\Gamma _2)\) or \(U_n^{\text {hybrid}(K)}(\Gamma _2)\) for  \(\Gamma _2(n)\) graphs, where \(\ell \in \{2, 4\}\) and \(K=\{2, 4\}\). An EUG \(U_n^{(\ell )}(\Gamma _2)\) or \(U_n^{\text {hybrid}(K)}(\Gamma _2)\) is constructed by creating two instances of \(U_n^{(\ell )}(\Gamma _1)\) or \(U_n^{\text {hybrid}(K)}(\Gamma _1)\), respectively, as described in Sect. 3.1. The two instances get merged to \(U_n^{(\ell )}(\Gamma _2)\) so that one builds the left inputs and outputs and the other builds the right inputs and outputs of the gates (based on the two-coloring of G). For efficiency reasons, we directly generate the merged edge-universal graph, i.e., an EUG for \(\Gamma _2(n)\), with the poles as common nodes. We partly include our optimization for the input and output nodes from Sect. 6.2.2Footnote 5 and Valiant’s optimizations for the base cases \(n\in \{2, 3, 4\}\), but do not consider Valiant’s optimizations for \(n \in \{5, 6\}\) [66]. Knowing the number of input bits u, the number of gates \(g\), and the number of output bits v, we construct the corresponding edge-universal graph \(U_n^\ell (\Gamma _2)\), where \(n=u+v+g\). We note that no knowledge is necessary about the topology or the gate tables in circuit C for this step.

  4. 4.

    Programming \(U_n^{\ell }(\Gamma _2)\) and \(U_n^{\text {hybrid}(K)}(\Gamma _2)\) According to an Arbitrary \(\Gamma _2(n)\) Graph. We edge-embed graph G into \(U_n^{(\ell )}(\Gamma _2)\) as described in Sect. 4 and into our hybrid \(U_n^{\text {hybrid}(K)}(\Gamma _2)\) with \(K=\{2, 4\}\) as described in Sect. 5.3. G is partitioned into two \(\Gamma _1(n)\) graphs \(G_1\) and \(G_2\) which are embedded into the two EUGs \(U_n^{\ell }(\Gamma _1)_1\) and \(U_n^{\ell }(\Gamma _1)_2\). Valiant proved in [66] that any topologically ordered \(\Gamma _1(n)\) graph can be edge-embedded in an EUG \(U_n^{\ell }(\Gamma _1)\) (cf. Sect. 3.1). We perform the embedding as described in Sect. 4 for Valiant’s 2-way and 4-way EUGs in Listing 1. The difference when using Zhao et al.’s improvement [72] is the block edge-embedding described in Sect. 4.1. Here, we utilize a lookup table derived from the computer generated proof of Zhao et al. [72] that maps the in and out vectors as defined in Sect. 4.1 into the programming bits of the block, i.e., can be used as block edge-embedding along with the recursion point edge-embedding described in Sect. 4.2. We edge-embed \(G_1\) and \(G_2\) into our 2/4-hybrid EUGs \(U_n^{\text {hybrid}(K)}(\Gamma _1)_1\) and \(U_n^{\text {hybrid}(K)}(\Gamma _1)_2\) as described in Sect. 5.3. When the edge-embedding is finished, we define the control bits of the programmable blocks (universal gates and switches) as described in Sect. 3.2.

  5. 5.

    Generating the Output Circuit Description and the Programming of the Universal Circuit. After embedding the graph of the simulated circuit into the edge-universal graph \(U_n(\Gamma _2)\), we write the resulting circuit in a file using our generic UC description. In the edge-universal graph, each node stores the control bit resulting from the edge-embedding (control bit c of the corresponding universal switch in Sect. 3.2) and each pole corresponding to a gate stores four bits (the four control bits of the function table of the corresponding gate in the original circuit \(C_{u, v}^g\), \(c_0, c_1, c_2, c_3\) in Eq. 1, their order possibly changed in Step 2). Thus, after topologically ordering \(U_n(\Gamma _2)\), one can directly write out the gate identifiers into a circuit file UC and the control bits to a programming file \(c^f\). We include our optimization from Sect. 6.2.1 and ignore extra nodes with fanin 1 when the graph is translated into a UC description. This improves the size of the recursion bases for \(n=\{4, 5, 6\}\) as well as of the head blocks [31, Fig. 2c and Fig, 4e] and Fig. 19a in “Appendix B.”

Our circuit description format is generic, i.e., consists of universal switches and universal gates. Therefore, any framework can be adapted to use them, independently from if it is interpreted as a Boolean or arithmetic UC. We start with enumerating the client input wires as \(C \quad 0 \quad 1 \ \ldots \ u-1\). As a reminder, the \({\mathcal {O}}(n\log n)\) server input wires are in the programming file \(c^f\). In the UC, we have universal gates denoted by U, universal switches denoted by X or Y depending on the number of outputs (X with two outputs and Y with one):

$$\begin{aligned} U&\qquad \text {in}_1\quad \text {in}_2\quad \text {out}_1 \end{aligned}$$
(20)
$$\begin{aligned} X&\qquad \text {in}_1\quad \text {in}_2\quad \text {out}_1 \quad \text {out}_2 \end{aligned}$$
(21)
$$\begin{aligned} Y&\qquad \text {in}_1\quad \text {in}_2\quad \text {out}_1 \end{aligned}$$
(22)

denotes that wire \(\text {out}_1\) (and possibly \(\text {out}_2\)) is coming from a gate with input wires \(\text {in}_1\) and \(\text {in}_2\). The control bits are not represented in the circuit format, but for each universal gate we save a four-bit number representing the control bits and for each universal switch we store the control bit in the programming file \(c^f\). The output nodes are outputs of Y universal switches and are marked in the end of the file as \(O \quad \text {o}_1 \quad \text {o}_2 \quad \ldots \quad \text {o}_v\). The circuit and its programming are given in plain text files as shown in Listings 3 and 4 in “Appendix C.”

Fig. 11
figure 13

Improvement in percentage of the UC sizes (number of switches) of our UC implementation of Valiant’s 4-way UCValiant-4 from [31] and our novel implementations including a modular version of Valiant’s 2-way UCValiant-2, Zhao et al.’s improved block UCZhao et al.-4 and hybrid constructions UCH(Valiant-2,4) and UCH(Valiant-2, Zhao et al.-4) over our implementation of Valiant’s 2-way UC from [45]

7.2 Experimental Evaluation

We ran all experiments for our UC compiler on a Desktop PC, equipped with an Intel Core i7-4790 CPU with 3.6 GHz and 32 GB RAM, and provide our results in this section. We performed experiments for circuit sizes \(n \in \{10, 100, \ldots , {1\,000\,000}\}\) as well as with notable circuits from [65] such as the AES-128 circuit without key expansion with size \(n={38\,518}\) and the SHA-256 circuit with size \(n={201\,206}\). We note that these sizes are for the circuits transformed to have fanin and fanout 2 as described in Sect. 2.2 and in [45, Table 1].

Circuit Sizes (Fig. 11). We first compare the circuit sizes of our implementations that slightly differ from the expected sizes shown in Sect. 6. Our initial 2-way UCValiant-2 implementation from [45] included the recursion bases for 1, 2, and 3 nodes and, however, did not include those proposed by Valiant [66] optimized for 4, 5, and 6 nodes. It included both size optimizations described in Sects. 6.2.1 and 6.2.2. In Fig. 11, we show the improvement over our UCValiant-2 implementation from [45] in percentage of the number of switches of our later, more modular UC implementations presented in this article and in [31]. We note that the number of universal gates is the same for all implementations, i.e., the number of gates in the original circuits g.

Our modular 4-way UCValiant-4 implementation from [31] additionally included the recursion base with 4 nodes and, however, only partly included the optimization described in Sect. 6.2.2 concerning the input and output nodes. The edges directed into the inputs and out of the outputs are also removed which results in smaller sizes due to the thus redundant nodes, however, not all unnecessary connections are deleted. This, however, incurs only a small overhead of at most \({\mathcal {O}}(u + v)\). As we can observe in Fig. 11 and as expected (cf. Table 5 on p. 32), this implementation improved by around 5% over our implementation from [45].

In this article, we have first implemented the modular version of Valiant’s 2-way UCValiant-2 where inherently we use the optimized recursion base with 4 nodes as well. An around 1.5-2% improvement can be observed over our non-modular implementation from [45]. Using this and our modular 4-way UCValiant-4, we have implemented our hybrid UCH(Valiant-2,4) using Valiant’s 2-way and 4-way UCs as proposed in [31]. This implementation has a more steady improvement of at least 5% for most tested circuit sizes. Moreover, we also implemented the optimized UCZhao et al.-4 proposed in [72], who have proved that their optimized block is universal by giving the programming for all possible path combinations in the block. We use this proof to generate a lookup table file for our implementation, which contains a mapping from any possible input–output vector (cf. Sect. 4.1) and the corresponding programming bits for the block. The generation of this lookup table is a one-time precomputation cost and takes around 82 seconds. In subsequent runs of the UC compiler, this overhead is no longer needed and a file of size 1.08 MB is read which takes only about 80 milliseconds. Thereafter, the expected gain of around 10% can be observed over our 2-way UCValiant-2 implementation from [45]. Moreover, the hybrid variant with this construction, i.e., UCH(Valiant-2, Zhao et al.-4), achieves an at least 10% improvement for all our example circuits.

Fig. 12
figure 14

Comparison of the runtime of our hybrid UC implementations using either Valiant’s 2-way and 4-way UCs or Valiant’s 2-way UC with Zhao et al.’s improved block. We note that the runtime of UCZhao et al.-4 only slightly differs from that of UCH(Valiant-2, Zhao et al.-4), and the runtimes of UCValiant-2 and UCValiant-4 only slightly differ from that of UCH(Valiant-2,4), and therefore, we omit them from the figure

In Table 6, we show the concrete number of switches of the smallest UCs generated with UCH(Valiant-2, Zhao et al.-4) as well as the sizes of the resulting UC and programming files. The universal circuit for \(n=1\) million gates has around 76 million switches and additionally around 1 million universal gates (which, in the PFE setting, results in a total of about 77 million AND gates for Yao’s garbled circuit protocol and 79 million AND gates for the GMW protocol). The corresponding file for the UC has size 2.8 GB, and the programming file has size 0.15 GB.

Table 6 Size of our smallest UCs generated with UCH(Valiant-2, Zhao et al.-4), i.e., its number of switches, the sizes of the UC, and programming files

Runtime (Fig. 12). To compare the runtime of our UC implementation with that of the UC compiler of [45], we ran the same experiments on the same platform using our novel implementations for UCValiant-2, UCZhao et al.-4, UCH(Valiant-2,4), and UCH(Valiant-2, Zhao et al.-4). Runtimes are reported as averages from 10 executions. The differences in runtimes for the different constructions are not significant, and therefore, we only depict the runtimes of our hybrid implementations UCH(Valiant-2,4) and UCH(Valiant-2, Zhao et al.-4) in Fig. 12.

The runtimes of our modular UCValiant-2 and UCValiant-4 implementations are very similar to those of UCH(Valiant-2,4), the latter of which becomes best for larger circuits (i.e., our examples with \(n\ge \) 10,000). The runtimes of UCZhao et al.-4 are only slightly lower than those of our hybrid UCH(Valiant-2, Zhao et al.-4), both of which include a one-time overhead of around 80 milliseconds for reading in our lookup table of size 1.08 MB for each possible block programming [72]. However, this one-time expense is only significant for small circuits as can be observed in Fig. 12, and UCH(Valiant-2, Zhao et al.-4) becomes faster than UCH(Valiant-2,4) for our examples with \(n\ge {10,000}\). The runtime of our original 2-way UCValiant-2 from [45] was slightly better due to its handling of the UC as one big block. However, it also becomes worse than UCH(Valiant-2, Zhao et al.-4) for our largest examples SHA-256 and the circuit for one million gates due to the gain in the size that results in a less complex embedding. For instance, it takes about 12 s to generate the smallest UCH(Valiant-2, Zhao et al.-4) with our new implementation for AES-128, while our original implementation for UCValiant-2 took 9.4 s. Our largest examples SHA-256 and a circuit with one million gates were generated and programmed in 2.1 and 18.6 min, respectively. The runtimes are high for these large examples; however, they are generally a one-time precomputation expense in most application scenarios such as private function evaluation (cf. Sect. 1.1).

Scalable 4-way UC Implementation (Figs. 13, 14). We also implemented our scalable 4-way UC generation algorithm presented in Sect. 5.4. We note that our implementation only includes \(H^i\), \(T^i_x\) and \(B_x^i\) for \(i={0, 1, 2, 3}\) and \(x = 4\) and does not include the optimized versions for \(x = 1, 2, 3\) which we leave as future work. Moreover, we include the base cases for \(n=1, 2, 3\) but not that for \(n=4\). This is due to the fact that a lot of engineering effort would be required for including the other options as well and our work is only a proof-of-concept implementation of our method presented in Sect. 5.4. Therefore, we test circuits with specific sizes where none of the other blocks or base case are required, i.e., where all subgraphs at each recursion step have 4 nodes in the tail block and the base case with \(n=4\) is not needed. Currently, for generating UCs for different sizes, one would need to pad the original circuit with dummy gates to an allowed size. Our aim was to improve the memory consumption of the UC generation (and programming) algorithm, while keeping the price paid in runtime as low as possible. The number of files created is the number of subgraphs in the UC, which is necessary for efficient scalable programming of the UC.

Fig. 13
figure 15

Comparison of the maximum memory used between our per-block and [31]’s UC generation. [31]’s implementation runs out of 32 GB of memory for \(n>{1\,398\,100}\) nodes

Fig. 14
figure 16

Comparison of the runtime of our per-block and [31]’s UC generations for up to about \(n={2,446,000}\) nodes, which fails with [31]’s UC generation and 32 GB of memory

We show that our scalable UC generation implementation provides the expected improvement in memory usage by comparing our scalable UCValiant-4 implementation to our implementation from [31]. We depict in Fig. 13 the memory usage of the generation algorithm with growing input circuit sizes on a machine with 32 GB RAM memory. As can be seen in the figure, instead of holding the whole UC of size \({\mathcal {O}}(n\log n)\) in memory, we indeed hold only \({\mathcal {O}}(n)\) information in memory at each step. When using 1 GB, 8 GB, and 32 GB of memory, we can generate a UC for over \(27\times \), \(28\times \), and \(29\times \) larger input circuit sizes n, respectively. Moreover, as can be observed in Fig. 14, the runtime of the resulting scalable UC generation is only around \(4\times \) that of the UCValiant-4 implementation of [31]. This difference is becoming smaller with increasing n due to the fact that the implementation of [31] is running short on memory and starts swapping to disk. Our experiments show that while reducing the memory requirements of our UC generation for UCValiant-4, we keep the runtime asymptotically the same (cf. Fig. 14). Moreover, the required storage capacity is also \({\mathcal {O}}(n\log n)\) as before, since the additionally stored data at each step are at most \({\mathcal {O}}(n)\), cf. Sect. 5.4.

8 Toolchain for Private Function Evaluation

Secure function evaluation (SFE) allows two parties to jointly compute a public function on their private inputs, without revealing anything to each other apart from the output of the computation. As it is probably the most prominent application of UCs (cf. Sect. 1.1), we implement private function evaluation (PFE) using SFE of a Boolean universal circuit. In this scenario, one of the parties holds its input x and the other party holds the programming \(c^f\) corresponding to a private function f that allows the UC to compute \(UC(x, c^f)=f(x)\). We note that the UC (with control bits for the universal gates and switches) can be publicly generated.

We have created a novel toolchain for private function evaluation (PFE) in [45], using the ABY framework for SFE (secure against semi-honest adversaries) as backend of our UC compiler. ABY implements state-of-the-art optimizations of Yao’s garbled circuit protocol [69, 70] and the GMW protocol [32]. We emphasize that our tool for constructing and programming UC is generic and can easily be adapted to other secure computation frameworks or other applications of UCs listed in Sect. 1.1.

8.1 Extension of the ABY Framework

We adapt the ABY secure two-party computation framework [19] for securely evaluating universal circuits. We realize the universal circuit building blocks (universal gates and switches) with a number of AND and XOR gates, which is the functionally complete set of logical gates that ABY uses. Since XOR gates can be evaluated for free in the underlying protocols for secure function evaluation due to the free-XOR optimization [43], from here on, we study the AND-size (\(\text {size}^{\text {AND}}\)) and AND-depth (\(\text {depth}^{\text {AND}}\)) of UCs, i.e., the number of AND gates and the maximum number of AND gates on the longest path, respectively. For other applications, however, the total sizes and depths of the UCs with respect to both AND and XOR gates are relevant. We implement universal gates and switches optimized for PFE and therefore use few AND gates, and only (free) XOR gates alongside it. X and Y gates are obtained as shown in [43]

$$\begin{aligned} \text {out}_1&= Y(\text {in}_1, \text {in}_2; c) = (\text {in}_1\oplus \text {in}_2)c \oplus \text {in}_1 \end{aligned}$$
(23)
$$\begin{aligned} (\text {out}_1, \text {out}_2)&= X(\text {in}_1, \text {in}_2; c) = (e \oplus \text {in}_1, e \oplus \text {in}_2) \text { with } e = (\text {in}_1 \oplus \text {in}_2)c \end{aligned}$$
(24)

with \(\text {size}^{\text {AND}}(Y)=\text {size}^{\text {AND}}(X)=\text {depth}^{\text {AND}}(Y)=\text {depth}^{\text {AND}}(X)=1\) for both universal switches. In case the SFE implementation uses Yao’s garbled circuit protocol [70], both \(\text {size}^{AND}(U)=1\) and \(\text {depth}^{AND}(U)=1\), due to the fact that in some garbling schemes (such as in the case of garbled 3-row reduction (GRR3) [55]) the evaluator does not learn the type of the evaluated gate. Therefore, a universal gate can be implemented using only one 2-input non-XOR gate [60]. For other SFE protocols such as GMW where this optimization is not possible, our efficient implementation of generic universal gates uses Y gates yielding

$$\begin{aligned} \text {out}_1 = U(\text {in}_1, \text {in}_2; c_0, c_1, c_2, c_3) = Y[Y(c_0, c_1; \text {in}_2), Y(c_2, c_3; \text {in}_2); \text {in}_1] \end{aligned}$$
(25)

with \(\text {size}^{\text {AND}}(U)=3\) and \(\text {depth}^{\text {AND}}(U)=2\). We note that the implementation of switches and universal gates might look very different when other 2-input Boolean gates can also be used, e.g., when other size metrics are to be minimized.

We include our implementation of these efficient UC building blocks in the open-source ABY framework https://encrypto.de/code/ABY. For evaluating a UC securely, the output universal circuit file of our UC compiler is parsed, a circuit UC is generated and evaluated with the input x and the control bits \(c^f\) to compute f(x). Our toolchain is the first implementation of Valiant’s size-optimized UC that supports efficient private function evaluation [45].

8.2 Experimental Results

We validate the practicality of our implementation, which is the first practical implementation of private function evaluation (PFE), cf. Sect. 1.1. We ran our experiments on two Desktop PCs, each equipped with an Intel Core i9-7960X CPU with 2.8 GHz and 128 GB RAM. We give the runtimes in Fig. 15 and communication in Fig. 16 for our example circuits from the previous section, i.e., for random circuits of sizes \(n\in \{10, 100, \ldots , {1{,}000{,}000}\}\) as well as the AES and SHA-256 circuits from [65]. For completeness, we give the exact numbers in Table 7 in “Appendix D.” Our runtime measurements are provided from an average of 10 executions, in two different settings: in a LAN setting with 10 Gbit/s bandwidth and 1 ms RTT, as well as in a simulated WAN setting with 100 Mbit/s bandwidth and 100 ms RTT.

We evaluate UCs in ABY [19] with both the GMW protocol [32] and Yao’s garbled circuit protocol [69] with state-of-the-art optimizations. Yao’s garbled circuit protocol achieves much better runtimes than that of the GMW protocol since the latter has \({\mathcal {O}}(n)\) rounds (i.e., the number of rounds is the depth of the circuit, and Valiant’s UCs have depth \({\mathcal {O}}(n)\), cf. Sect. 6.1 and Table 7 in “Appendix D”), whereas Yao’s protocol runs in 3 rounds. The effect of this is especially apparent in the WAN setting where the round-trip time is much higher. In both settings, the runtime of the GMW protocol is dominated by the linear term due to the linear number of online rounds. The amount of communication is similar in both implementations; however, it could be reduced by half for Yao’s protocol if X and Y switches would be implemented with the optimization from [43] using only one ciphertext. The current implementation utilizes two ciphertexts per X and Y switches.

Due to the clear advantage of Yao’s protocol over the GMW protocol, we highly recommend using Yao’s protocol when evaluating UCs securely for PFE. Investigating depth-optimized UCs [17] with \({\mathcal {O}}(d)\) depth in the depth of the input circuit d could improve the performance of the GMW protocol; however, its number of rounds will still depend on d, whereas Yao’s protocol runs in only 3 rounds.

Fig. 15
figure 17

Total runtime in seconds on LAN/WAN of PFE with the best available UC variant UCH(Valiant-2, Zhao et al.-4)

Fig. 16
figure 18

Total communication in megabytes of PFE with the best available UC variant UCH(Valiant-2, Zhao et al.-4)

8.3 Comparison of PFE Approaches

Mohassel et al. in [53] design a generic framework for PFE and apply it to three different scenarios: to the m-party GMW protocol [32], to Yao’s garbled circuits [70], and to arithmetic circuits using homomorphic encryption [16]. Both the two-party versions of their framework with the GMW protocol and the one with Yao’s garbled circuit protocol have two alternatives: Using homomorphic encryption, they achieve linear complexity \({\mathcal {O}}(n)\) in the circuit size n, and when using a solution solely based on oblivious transfers (OTs), they obtain a construction with \({\mathcal {O}}(n\log n)\) symmetric-key operations. The OT-based construction in both cases is more desirable in practice, since OT extension reduces the number of expensive public-key operations significantly [2, 36].

Fig. 17
figure 19

The number of symmetric-key operations of different PFE protocols: Valiant’s UC with Yao’s garbled circuits, Mohassel et al.’s OT-based method from [53] and its optimized version from [6]

Fig. 18
figure 20

Communication of different PFE protocols in megabytes: UCH(Valiant-2, Zhao et al.-4) with Yao’s garbled circuits, Mohassel et al.’s OT-based method from [53] and its optimized version from [6]

As the asymptotical complexity of this construction and using Valiant’s UC for PFE is the same, we compare these methods for PFE. We revisit the formulas provided in [53] for the PFE protocol based on Yao’s garbled circuits and elaborate on the number of symmetric-key operations when the different PFE protocols are used. Mohassel et al. show that the total number of switches in their framework is \(4{\tilde{g}}\log _2 (2{\tilde{g}}) + 1\) that are evaluated using OT extension, for which they calculate \(8{\tilde{g}}\log _2 (2{\tilde{g}}) + 8\) symmetric-key operations together with \(5{\tilde{g}}\) operations for evaluating the universal gates with Yao’s protocol. We count only the work of the party that performs most of the work, i.e., \(4{\tilde{g}}\) symmetric-key operations for creating a garbled circuit with \({\tilde{g}}\) gates and 3 symmetric-key operations (two calls to a hash function and one call to a pseudorandom function (PRF)) for each OT using today’s most efficient OT extension of [2]. Hence, according to our estimations, the protocol of [53] requires \(12{\tilde{g}}\log _2 (2{\tilde{g}}) + 4{\tilde{g}}+ 12\) symmetric-key operations.

In the same way, we assume that in the case of PFE with UCs, for both the universal gates and switches, the garbler needs 4n symmetric-key operations. In this case, however, \(n=u+v+g\), where \({\tilde{g}}\le g\le 2{\tilde{g}}+ v\). It is, therefore, difficult to directly compare complexities of specifically designed protocols with \({\tilde{g}}\) fanin-2 gates and UCs where the input circuit is required to have fanout 2 as well. In Fig. 17, we therefore depict the minimum and maximum required number of symmetric-key operations for circuits with size \({\tilde{g}}\in \{10, 100, \ldots , {1{,}000{,}000}\}\). Moreover, we depict the concrete values with real-world circuits (AES-128 and SHA-256 from [65]) with UC with SFE, and note that for the other approaches the points lie on the corresponding line.

The protocol of [53] has been improved to achieve better communication in [6]. The communication of the protocol of [53] is \((10 {\tilde{g}}\log _2 {\tilde{g}}+ 4 {\tilde{g}}+ 5)\cdot 128\), while that of [6] is \((6{\tilde{g}}\log _2{\tilde{g}}+ 0.5{\tilde{g}}+ 3)\cdot 128\). For SFE with UC, we require one ciphertext per X and Y switches [43] and \(3\cdot 2\) ciphertexts per universal gates. Figure 18 depicts the comparison between the communication of SFE with UCs with minimum and maximum values depending on the relation of \(g\) and \({\tilde{g}}\) as before and the alternatives of [53] and [6]. We can see that SFE with UCs always achieves the best communication, requiring 1.5-\(3\times \) less communication than the improvement of [6].

9 Conclusion

Universal circuits (UCs) are highly relevant for various applications such as verifiable computation, attribute-based encryption, and private function evaluation (PFE) which can, for example, be used for privacy-preserving evaluation of diagnostic programs, proprietary software and in private database management systems. These applications require size-optimized universal circuits, first proposed by Valiant [66]. Since then, several optimizations appeared to further reduce the size of the UCs.

In this article, we revisit Valiant’s original constructions and the optimizations later proposed by our previous works by Kiss and Schneider [45] and Günther et al. [31] as well as by Zhao et al. [72]. We have shown the practicality of Valiant’s universal circuit constructions and its several improvements by providing the implementation of the most efficient UC to date with size \({\sim }\,4.5 n\log _2n\) in the input circuit size n. Moreover, we highly improve the memory consumption of our UC generation algorithm by designing and implementing a method that utilizes \({\mathcal {O}}(n)\) memory instead of the previous methods using \({\mathcal {O}}(n\log n)\) memory.

Universal circuits for an input circuit size of one million can be generated and programmed within a matter of around 18 minutes on a standard PC and utilized in various applications. We demonstrate the practicality of PFE with the secure evaluation of UCs and show that such a large universal circuit can be evaluated within 1.3 and 5.9 minutes using Yao’s garbled circuit protocol in LAN and WAN settings, respectively.