Efficient and Scalable Universal Circuits

A universal circuit (UC) can be programmed to simulate any circuit up to a given size n by specifying its program inputs. It provides elegant solutions in various application scenarios, e.g., for private function evaluation (PFE) and for improving the flexibility of attribute-based encryption schemes. The asymptotic lower bound for the size of a UC is Ω(nlogn)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Omega (n\log n)$$\end{document}, and Valiant (STOC’76) provided two theoretical constructions, the so-called 2-way and 4-way UCs (i.e., recursive constructions with 2 and 4 substructures), with asymptotic sizes ∼5nlog2n\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\sim }\,5n\log _2n$$\end{document} and ∼4.75nlog2n\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\sim }\,4.75n\log _2n$$\end{document}, respectively. In this article, we present and extend our results published in (Kiss and Schneider EUROCRYPT’16) and (Günther et al. ASIACRYPT’17). We validate the practicality of Valiant’s UCs by realizing the 2-way and 4-way UCs in our modular open-source implementation. We also provide an example implementation for PFE using these size-optimized UCs. We propose a 2/4-hybrid approach that combines the 2-way and the 4-way UCs in order to minimize the size of the resulting UC. We realize that the bottleneck in universal circuit generation and programming becomes the memory consumption of the program since the whole structure of size O(nlogn)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {O}}(n\log n)$$\end{document} is handled by the algorithms in memory. In this work, we overcome this by designing novel scalable algorithms for the UC generation and programming. Both algorithms use only O(n)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {O}}(n)$$\end{document} memory at any point in time. We prove the practicality of our scalable design with a scalable proof-of-concept implementation for generating Valiant’s 4-way UC. We note that this can be extended to work with optimized building blocks analogously. Moreover, we substantially improve the size of our UCs by including and implementing the recent optimization of Zhao et al. (ASIACRYPT’19) that reduces the asymptotic size of the 4-way UC to ∼4.5nlog2n\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\sim }\,4.5n\log _2n$$\end{document}. Furthermore, we include their optimization in the implementation of our 2/4-hybrid UC which yields the smallest UC construction known so far.


Introduction
Any computable Boolean function f (x) can be represented as a Boolean circuit C g u,v (x) with u input wires x = (in 1 , . . . , in u ), v output wires out 1 , . . . , out v , and g gates for some u, v, g. The size of such a Boolean circuit is n = u + v + g. Universal circuits (UCs) are programmable circuits that can simulate any Boolean function f (x) up to a given size n. To program a UC to compute f , programming or control bits are specified as further inputs c f = {c 1 , . . . , c m }. The UC then receives these control bits as inputs along with the input x and computes the result as UC(x, c f ) = f (x). This means that the same UC can evaluate different Boolean circuits by specifying the respective control bits. In analogy to a universal Turing machine, a universal circuit allows to turn any function into data in the form of a program description.
Several efficient constructions considering both the size and the depth of UCs were proposed. Valiant proposed in [66] an asymptotically size-optimal UC construction with size (n log n) and depth O(n) [68]. He presents two constructions, called 2-way and 4-way UCs, based on so-called edge-universal graphs (EUGs) that utilize either 2 or 4 subcircuits, respectively. The asymptotic complexity of the 4-way UC is ∼ 4.75n log 2 n which is smaller than that of the 2-way UC of ∼ 5n log 2 n [66]. The 4-way UC has been further improved in [72], where its size is reduced to ∼ 4.5n log 2 n. An asymptotically depth-optimal construction with depth (d) that simulates circuits with depth d was proposed in [17], but it has a significantly larger size of O(n 3 d/ log n). In our paper, due to the applications in cryptography that we revisit in Sect. 1.1, we concentrate on the existing size-optimized UCs, especially that proposed by Valiant [66] with asymptotic size (n log n) with the optimization presented by Zhao et al. in [72].

Applications of Universal Circuits
Size-optimized universal circuits have many applications, which we review here and refer to the original publications for a more detailed description.

Private Function Evaluation (PFE)
The most prominent application of universal circuits is the secure evaluation of private functions based on secure function evaluation (SFE) or secure computation. SFE enables two parties P 1 and P 2 to evaluate a publicly known function f (x, y) on their respective private inputs x and y, ensuring that none of the participants learns anything about the other participant's input apart from the output of the computation. Many secure computation protocols, such as Yao's garbled circuit protocol [47,69,70] and the GMW protocol [32], use Boolean circuits for representing the desired functionality. In some applications, the function itself should be kept private. This setting is called private function evaluation (PFE), where we assume that only one of the parties P 1 knows the function f (x), whereas the other party P 2 provides the input to the private function x. P 2 should learn no information about f except for an upper bound on the size of the circuit describing the function, and P 1 should learn nothing about x beyond what can be inferred from the result f (x). PFE can be reduced to SFE [1,44,58,63] by securely evaluating a UC that is programmed by P 1 to evaluate the function f on P 2 's input x. For this, P 1 provides the control bits c f for the UC and P 2 provides his private input x into an SFE protocol that computes UC(x, c f ). Here, the UC is a public function and the control bits c f -and therefore the function f -and input x are kept private due to the properties of SFE. The first implementation of PFE was provided in [44,61], which extends the Fairplay secure computation framework [51] with universal circuits. The underlying UC construction achieves a non-optimal asymptotic size of O(n log 2 n) and depth O(n log n). We have shown in [45] that it results in larger UCs than Valiant's constructions for all reasonable circuit sizes in practice. The complexity of PFE in this case is determined mainly by the size and depth of the UC, while the security follows from that of the SFE protocol that is used to evaluate the UC. If the SFE protocol is secure against semi-honest, covert, or malicious adversaries, then the PFE protocol is secure in the same adversarial setting. UC-based PFE can be easily integrated into any SFE framework and can directly benefit from recent optimizations. For instance, outsourcing UC-based PFE to two or multiple servers using XOR secret sharing is directly possible with outsourced SFE [42]. The non-interactive secure computation protocol of [3] can be generalized to obtain a noninteractive PFE protocol [46]. Moreover, with UC-based PFE, evaluating public and private parts of a functionality can easily be performed together without modifying the underlying secure computation framework.
In [40], Katz and Malka presented an alternative approach for PFE that does not rely on UCs. They use additively homomorphic public-key encryption as well as a symmetrickey encryption scheme and achieve constant-round PFE with linear O(n) communication complexity. However, the number of public-key operations is linear in the circuit size, and due to the gap between the efficiency of public-key and symmetric-key operations, this results in a less efficient protocol. Their protocol is secure against semi-honest adversaries, uses Yao's garbled circuits [70], and has recently been improved in [5], where the authors modify the algorithm to perform one full execution from which information can be reused in subsequent more efficient executions of the protocol. Mohassel and Sadeghian consider PFE with semi-honest adversaries in [53] and propose a generic PFE framework that can be instantiated with different secure computation protocols. Their first protocol uses homomorphic encryption with which they achieve linear complexity O(n) in the circuit size n and their second protocol relies solely on oblivious transfers (OT), which results in a method with O(n log n) symmetric-key operations. The OT-based construction from [53] or PFE using UCs is more desirable than the linear homomorphic encryption-based methods in practice, since using OT extension, the number of expensive public-key operations can significantly be reduced, such that it is independent of the number of OTs [2,36]. Biçer et al. [6] improve the communication of the OT-based PFE protocol of [53] by around 40%. The asymptotic complexity of the OT-based construction of [53] and Valiant's UCs for PFE is the same, and therefore, we compare these solutions for PFE in more detail in Sect. 8. Mohassel et al. extend the framework from [53] to malicious adversaries in [54] with linear complexity O(n), using additively homomorphic encryption. Active security of UC-based PFE is achieved by using a secure computation protocol with active security. Even though their claimed better efficiency, to the best of our knowledge, these protocols have not yet been imple-mented and are not as generally applicable as PFE with UCs, e.g., they cannot be easily combined with secure evaluation of public functions.
Semi-private function evaluation (semi-PFE) has been proposed in [60] and allows for PFE where the function f is in a set of functions F known by both parties. This relaxes the necessary topology hiding requirement of generic PFE. Yao's garbled circuit can be used for evaluating circuits of the same topology as shown in [59]. Recently, an automated approach for semi-PFE has been proposed in [39], where the circuits representing f ∈ F have varying topologies, for which a container topology is found that can be programmed to compute any of the available topologies. This has therefore been defined as a set-universal circuit, i.e., a circuit that can be programmed to compute any circuit from a pre-defined set of circuits. This approach has been further improved in [41], where a modified garbled circuit protocol allows for efficient semi-PFE with linear communication in the size of the largest circuit in F. However, semi-PFE does not suffice for generic PFE where we have an exponential number of possible circuit topologies.

Applications of PFE
PFE can be applied in scenarios where one of the parties wants to keep the evaluated function private. One of the first applications for PFE was privacy-preserving checking for credit worthiness [21], where not only the loanee's data, but also the loaner's function that computes if the loanee is eligible for a credit needs to be kept private. The original scheme, using garbled circuits, can represent simple policies, but by evaluating a UC their scheme can be extended to more complicated credit checking policies. [15] shows an application for secure computation, where evaluating UCs or other PFE protocols would ensure privacy: When autonomous mobile agents migrate between several distrusting hosts, the privacy of the inputs of the hosts is achieved using SFE, while privacy of the mobile agent's code can be guaranteed with PFE. [57] shows a method to filter remote streaming data obliviously, using secret keywords and their combinations. Their scheme can additionally preserve data privacy by using PFE to search the matching data with a private search function. PFE allows for running proprietary software on private data, such as privacy-preserving evaluation of diagnostic programs that was considered in [13], where the owner of the program does not want to reveal the diagnostic method and the user does not want to reveal his data. Example applications for such programs include medical diagnostics [9] and remote software fault diagnosis, where the function and the user's input are desired to be handled privately. In the protocol presented in [13], the diagnostic programs are represented as binary decision trees or branching programs which can easily be converted into a Boolean circuit representation and evaluated using PFE based on universal circuits. Moreover, PFE can be applied to create blinded policy evaluation protocols [20,24]. [20] utilizes UCs for so-called oblivious circuit policies and [18] for hiding the circuit topology in order to create one-time programs. In [25,59], universal circuits are used for hiding queries in private database management systems (DBMSs). The Blind Seer DBMS [25] was improved in [59] by making use of a simpler UC for evaluating queries, which does not hide the circuit topology. The authors mention that in case the topology of the SQL formula and the circuit have to be kept private, a generic UC should be utilized. Further applications of PFE given in [53] are evaluation of branching programs on encrypted data [37] and privacy-preserving intrusion detection [56].

UC Applications Beyond PFE
Apart from being used for PFE, UCs can be applied in various other scenarios. Efficient verifiable computation on encrypted data was studied in [22]. A verifiable computation scheme was proposed for arbitrary computations, and a UC is required to hide the function. [29] make use of UCs for reducing the verifier's preprocessing step. In [30], a DDH-based multi-hop homomorphic encryption scheme is proposed that uses rerandomizable garbled circuits, for which UCs are used to achieve function privacy. When the common reference string is dependent on a function that the verifier is interested in outsourcing, then the function description can be provided as input to a UC of appropriate size. As described in [4], the Attribute-based encryption (ABE) schemes [27,34] for any polynomial-size circuits can be turned into ciphertext-policy ABE by using UCs. The ABE scheme of [28] also uses UCs. Universal circuits can be applied for program obfuscation. Candidates for indistinguishability obfuscation are constructed using a UC as a building block in [14,26]. The algorithm of [26] has been implemented in [12], which can be improved using Valiant's UC implementation [45]. Direct program obfuscation was proposed in [71], where the circuit is a secret key to a UC. [46] mentions that UCs can be applied for secure two-party computation in the batch execution setting, where the cost of evaluating Yao's garbled circuits is amortized if the same circuit-a UC-is evaluated [35,49]. This protocol has been made round-optimal in [52].

Implied Theoretical Results
We mention two theoretical results relying on UCs. Both the depth-optimized UC from [17] and Valiant's size-optimized UCs were adapted in [8] to construct universal quantum circuits. The design of universal parallel computers was inspired by Valiant's UCs as well [33,50].

Our Contributions and Outline
In Sect. 2, we recapitulate the necessary preliminaries for our work. We revisit the asymptotically size-optimal UCs of [66] in Sect. 3. This complex construction makes use of an internal graph representation and programs a so-called edge-universal graph (Sect. 3.1). Thereafter, we describe how an edge-universal graph can be translated into a universal circuit (Sect. 3.2). Finally, we revisit Valiant's 2-way (Sect. 3.3) and 4-way UCs (Sect. 3.4) and the improved building block proposed by Zhao et al. [72] for the latter.
Our modular programming algorithm (Sect. 4). We detail our modular algorithm for programming a universal circuit that provides the description of the input function f as program bits c f to the UC, for both Valiant's 2-way and 4-way UCs. Our method consists of two steps, the block edge-embedding (Sect. 4.1) and the recursion point edge-embedding (Sect. 4.2).
New universal circuit constructions and extensions (Sect. 5). We describe Lipmaa et al.'s generalization [46] of Valiant's universal circuit to any k-way UC (Sect. 5.1) and detail how our modular programming algorithm from Sect. 4 can be directly gener-alized for this extension. We continue with presenting a new 3-way UC (Sect. 5.2) that is predicted to be more efficient than the existing UCs. However, after providing modular building blocks for this UC, we show that it is asymptotically larger than Valiant's UCs, due to an optimization that cannot be applied for one of its building blocks. Then, we propose a hybrid UC construction (Sect. 5.3) that can efficiently combine k-way UCs for multiple values of k. With this, we combine Valiant's 2-way and 4-way UCs to achieve the smallest universal circuit known so far. Lastly, we provide our scalable algorithms (Sect. 5.4) that allow for generating and programming UCs with only linear O(n) memory instead of handling the whole structure of size O(n log n) in memory at once.
Optimized size and depth of UCs (Sect. 6). We compare the asymptotic (Sect. 6.1) and concrete (Sect. 6.2) sizes of Valiant's (2-way and 4-way) UCs and that of different k-way UCs. We show that of all k-way UCs of Lipmaa et al. [46], Valiant's 4-way UC provides the smallest size for large circuits, whereas Valiant's 2-way UC provides the smallest depth. We include size optimizations, achieving a linear concrete improvement for all UCs. Moreover, we show that our 2/4 hybrid method for generating UCs improves over the 4-way UCs, i.e., both over Valiant's 4-way UC and over the optimized 4-way UC of [72].
Implementation of Valiant's UCs and experiments (Sect. 7). We detail the steps of our algorithm for a practical realization of Valiant's UC construction and implement the 2-way and recently optimized 4-way UCs as well as our 2/4 hybrid UC construction. We note that our implementation is the first implementation that includes the optimization of Zhao et al. [72], which achieves the best size ∼ 4.5n log 2 n to date. We describe the architecture of our UC compiler (Sect. 7.1). We experimentally evaluate the performance of our UC generation and programming algorithms with a set of example circuits (Sect. 7.2). We provide the evaluation of our scalable 4-way UC as well and compare it with our memory-based implementation of Valiant's 4-way UC.
Toolchain for private function evaluation using universal circuits (Sect. 8). We provide the implementation of an example application for universal circuits, namely of private function evaluation (PFE) by extending the ABY secure function evaluation framework [19] to evaluate our universal circuits (Sect. 8.1). We provide the first implementation for PFE with O(n log n) complexity and show experimental results for performing PFE (Sect. 8.2). We theoretically compare PFE with UCs with other state-of-the-art approaches for PFE (Sect. 8.3).

Additions to Conference Versions
This journal article is a significantly extended and improved version of the conference publications [45] and [31]. Our added contributions are as follows.
1. Optimizations. We included the optimized building block of [72] in our 4-way and hybrid implementations as well as in the size and depth comparisons. This allows us to compare all state-of-the-art methods for UCs. This is the first implementation of their construction, which has the lowest asymptotic and concrete sizes known so far.
2. Scalability. We extend our design and implementation with a scalable 4-way UC construction based on Valiant's 4-way UC, which reduces the memory complexity from O(n log n) to O(n) when generating and programming the universal circuit. This construction involves a novel layer-by-layer approach for generating and topologically ordering the universal circuit and programs the structure according to the recursion steps, i.e., subcircuit by subcircuit. 3. Universal circuit depths. We examine the depth of the universal circuits in addition to their sizes, since though being optimized for the latter, some applications also require to minimize the former. For instance, the number of communication rounds in PFE via secure function evaluation with the GMW protocol [32]which in contrast to Yao's garbled circuits allows to precompute all symmetric cryptographic operations [64]-depends on the depth of the universal circuit. 4. Comparison and implementation. In our previous works, we have compared the 2-way and 4-way UCs with each other and with the only other existing UC of [44]. In this work, we implement the hybrid method that uses both 2-way and 4-way UCs and achieves the best concrete size for all simulated circuit sizes. We also implement our new scalable 4-way UC construction, which utilizes very different algorithms than those applied before for UC generation. We compare these methods with respect to runtime, communication, and memory consumption.

Preliminaries
As preliminaries for our paper, we introduce the graph and circuit theoretic background in Sect. 2.1 and Sect. 2.2, respectively. We provide a summary of all our notations and abbreviations in "Appendix A."

Graph Theory
In this section, we describe the graph theoretic preliminaries necessary for our work. We denote by ρ (n) the set of all directed acyclic graphs with n nodes and fanin and fanout ρ.
In short, i > j implies that there is no edge or directed path from i to j.
A topological order of G ∈ ρ (n) can be found with computational complexity O(ρn). Further on, we require a labeling of the nodes in a topological order.

Definition 3.
Edge-embedding is a mapping from graph G = (V, E) into G = (V , E ) that maps V into V one-to-one, with possible additional nodes in V , i.e., V ⊆ V and E into directed paths in E , such that all paths are pairwise edge-disjoint, i.e., an edge can be used only in one path.
Theorem 1. (Kőnig-Hall theorem) Given a directed acyclic graph (DAG) G ∈ 2 (n), the set of edges E can be separated into two disjoint sets E 1 and E 2 , such that graphs G 1 = (V, E 1 ) and G 2 = (V, E 2 ) are instances of 1 (n), having fanin and fanout 1 for each node [38,48,66]. Choose an uncolored edge e = (m i , m j ) randomly and color the path or cycle that contains it in an alternating manner: The neighboring edge(s) of an edge of the first color will be colored with the second color and vice versa. 3: end while This edge-coloring can be performed in O(n) steps and it defines the edges in E 1 and E 2 , such that E 1 contains the edges colored with color one and E 2 the ones with color two and G 1 = (V, E 1 ) and G 2 = (V, E 2 ).
The Kőnig-Hall theorem was used in [45,46] to provide a 2-coloring algorithm for the edges of a graph with fanin and fanout 2. In its originally proposed form, however, Kőnig's theorem [38,48] applies also for k-coloring the edges of any graph with at most k incoming and outgoing edges for each of its nodes. This transformation can be easily generalized to graphs in k (n), in which case the resulting bipartite graph will have fanin and fanout k. We review this theorem and the corresponding algorithm here.

Theorem 2. (Kőnig's theorem) If G is bipartite and its nodes have at most k incoming and outgoing edges, then the number of colors sufficient to color all edges of G is k.
Proof of Theorem 2. ([38,48]) Take colors {1, . . . , k}, and greedily color edges. Let us assume that at some point the coloring stops because we cannot color more edges. In this step, (w i , z j ) is an uncolored edge. If we look at the colors of the edges adjacent to w i and z j , we can define the set of available colors for both nodes. There is at least one color for both w i and z j due to the fanin and fanout restriction, but there is no color which is available for both nodes, otherwise we could color (w i , z j ).
There is a color that is used in an edge adjacent to w i , e.g., color a, but not on an edge adjacent to z j . In the same way, we can find another color b that is used in an edge adjacent to z j , but not to w i . Take the longest unique path P from w i that uses colors a and b alternatingly.
Indirectly, assume that this path also contains z j . It then terminates in z j due to the fact that z j is not adjacent with an edge colored with a. Then, P ∪ (w i , z j ) is an odd cycle, which is impossible since G is bipartite. Therefore, p does not contain z j , and we can exchange colors a and b on path P and color (w i , z j ) with color a.
This process is continued until there are no uncolored edges in G. Proof of Theorem 3. Shannon's expansion theorem [61,62] describes how gates with larger fanin can be reduced to gates with two inputs by adding additional gates, which results in a circuit Cg u,v withg fanin 2 gates. It was proven in [66] that the general case, where the fanout of the circuit can be any integer ρ ≥ 2, can be transformed to the special case when ρ ≤ 2 by introducing copy gates, each of which eliminates one from the extra fanout of the original gate. We place a binary tree in place of each gate with fanout larger than 2, following Valiant's proposition: "Any gate with fanout x + 2 can be replaced by a binary fanout tree with x + 1 gates" [66, Corollary 3.1]. Thus, the class of Boolean functions with u inputs and v outputs that can be realized by acyclic circuits withg gates and arbitrary fanout can also be realized with an acyclic fanout-2 circuit withg ≤ g ≤ 2g + v gates.

Definition 5.
We can regard C g u,v with u inputs, v outputs, and g gates as a 2 (n) graph G-which we commonly refer to as the graph of circuit C g u,v -with n = u +v + g by creating a node for each input, gate, and output, and an edge for each wire in C g u,v .

Valiant's Universal Circuit Constructions
In any circuit Cĝ u,v , the inputs of each of theĝ gates are either connected to one of the u inputs, to the output of a previous gate, or are assigned a fixed constant. Due to the nature of Valiant's edge-universal graph (EUG) construction, the input circuit must have fanin and fanout 2, which can be achieved with the transformations described in Sect. 2.2 and implemented in [44,45]. From here on, and without loss of generality, we assume that our input circuit C g u,v has u inputs, g gates and v outputs and fanin and fanout 2. The size of a function f represented by a circuit C g u,v with fanin and fanout 2 is n = u + v + g, which can be represented as a graph G ∈ 2 (n). In this section, we describe Valiant's UC constructions [66,68] that can be programmed to evaluate any function of size n. We explain the general idea behind Valiant's UC construction [66] in Sects. 3.1 and 3.2, and the 2-way and 4-way UCs along with improvements of [31,45,46,72] in Sects. 3.3 and 3.4, respectively.

Valiant's Edge-Universal Graph Construction
Valiant's UC construction relies on the notion of so-called edge-universal graphs that are then translated to universal circuits [66].
An EUG U n ( ρ ) has distinguished nodes called poles P = {p 1 , . . . , p n } ⊆ V U where each node a ∈ V = {1, . . . , n} is mapped to exactly one pole with an injective mapping ϕ V : V → V U . This mapping is defined by a concrete topological order η G of the original graph G with ϕ V (a) = p η G (a) , i.e., every node in G has a corresponding pole in U n ( ρ ). Apart from the poles, U n ( ρ ) might have additional nodes that enable the edge-embedding (cf. Sect. 2.1). For each edge (a i , a j ) ∈ E, we then define a path of variable length z between the corresponding poles All these paths are edge-disjoint, i.e., they do not use any edge in U n ( ρ ) in more than one path (cf. Sect. 2.1).
Let U n ( 1 ) be an EUG for graphs in 1 (n) with n poles P = {p 1 , . . . , p n } (we will show concrete constructions for such EUGs in Sect. 3.3 and in Sect. 3.4). The nodes of any topologically ordered 1 (n) graph can be mapped to these poles. The poles have fanin and fanout 1, while all other nodes have fanin and fanout 2.

Translating Edge-Universal Graphs into Universal Circuits
In this section, we define universal circuits (UCs) and describe how an edge-universal graph is translated into a universal circuit.

Definition 7.
A universal circuit U C is a Boolean circuit that can be programmed to compute any circuit C g u,v up to a given size n by defining a set of programming bits c f such that UC(x, c f ) = C g u,v (x).
In Valiant's UC constructions, every node w ∈ V U fulfills a task when U n ( 2 ) is translated to a UC. Programming the UC means specifying its control bits along the paths defined by the edge-embedding and by the gates of circuit C g u,v . Depending on the number of incoming and outgoing edges and its type, a node w is translated as described below and shown in the example in Fig. 1f.

G1
If w is a pole and corresponds to an input (one of the first u poles) or an output (one of the last v poles) in G, then w is an input or output in C g u,v as well. G2 If w is not a pole and has indegree 1 and outdegree 2, this node has been placed to copy its input to its two outputs. Therefore, when translated to a UC, w is replaced by multiple outgoing wires in the parent node (as described in [45]), since the UC does not need to fulfill the fanout 2 restriction. In U n ( 2 ), w is added due to the fanout 2 restriction in the EUG necessary for the edge-embedding. G3 If w is not a pole and has indegree and outdegree 1, w is removed and replaced by a wire between its parent and child nodes.  It implements function U : G5 If w is not a pole and has indegree and outdegree 2, w is programmed as an X-switching block, which computes X : Fig. 2a. The inputs of an X-switching block are forwarded to its outputs, switched or not switched, depending on control bit c. G6 If w is not a pole and has indegree 2 and outdegree 1, w is programmed as a Yswitching block that computes Y : Fig. 2b. The inputs of a Y-switching block are forwarded to its output depending on the control bit c, i.e., it provides the functionality of a 2-input multiplexer.
We note that the u inputs and the v outputs can be ordered arbitrarily within themselves as long as the inputs are kept before the g topologically ordered gates and the outputs after them. Even though the output nodes cause an overhead in Valiant's UC, they are required to fully hide the topology of the circuit in the corresponding universal circuit. Note that optionally it is possible to modify the input circuit such that the outputs of the last v gates in order are the outputs of the circuit by inserting at most v copy gates [40].
The nodes programmed as UG (G4), X-switching block (G5), or Y-switching block (G6) are so-called programmable blocks. This means that a control bit c or vector c = (c 1 , c 2 , c 3 , c 4 ) is necessary aside from the two inputs to define their behavior. The universal gates are programmed according to the simulated gates in C g u,v and the universal switches according to the paths defined by the edge-embedding of the graph of the circuit G into the edge-universal graph U n ( 2 ). Depending on whether the path takes the same direction during the embedding (e.g., arrives from the left and continues on the left) or changes its direction at a given node (e.g., arrives from the left and continues on the right), the control bit of the universal switch is programmed accordingly. In Sect. 7.1, we describe efficient implementations of programmable blocks. All control bits and vectors together are the programming c f of the UC.

Valiant's 2-way UC Construction
We described in Sect. 3.1 that a U n ( ρ ) EUG can be constructed of ρ instances of U n ( 1 ) EUGs. Valiant [66] provides an EUG for 1 (n) graphs, two of which can build an EUG for 2 (n) graphs, which suffices for circuits with 2-input gates that have at most two outgoing wires. Let P = {p 1 , . . . , p n } be the set of poles in U n ( 1 ) that have indegree and outdegree 1, corresponding to the inputs, gates and outputs of the . . , p n } to the outputs. The main, so-called body block B (2) used for constructing Valiant's EUG for 1 (n) graphs U (2) n ( 1 ) of size ∼ 2.5n log 2 n is shown in Fig. 3 and consists of 2 poles (large circles), 4 so-called recursion points (rectangles), and 3 additional nodes (small circles). The corresponding UC has twice the size ∼ 5n log 2 n, since it corresponds to an EUG for 2 (n) graphs. This construction is called the 2-way EUG or UC construction since there are two sets of recursion nodes at each recursion step as we describe below.
The recursive construction works as follows: The rectangles are special nodes that build up the set of poles in the next recursion step, i.e., R 1 − 1 , such that we have four subgraphs at the next level, etc. The blocks are chained together at the recursion points to form a skeleton, i.e., each recursion point belongs to two in the corresponding subgraph. Thus, the main skeleton of the UC consists of n 2 such blocks with poles { p 1 , p 2 , . . . , p n }, and the next two skeletons consist of We note that the top (resp. bottom) block of a skeleton does not need the upper (resp. lower) recursion points since its poles are the inputs (resp. outputs) in the block. Therefore, we presented optimized so-called head H (2) and tail T (2) blocks that occur in the top and bottom of a skeleton, respectively, in [31, Fig. 2b-e].
Proof of Theorem 4 [Val76]. We recapitulate the proof from [66] that U (2) n ( 1 ) is edgeuniversal for 1 (n), such that any graph with n nodes and fanin and fanout 1 can be edge-embedded into U (2) n ( 1 ). According to the definition of edge-embedding, it has to be shown that given any 1 (n) graph G = (V, E), for any (i, j) ∈ E and (k, l) ∈ E we can find pairwise edge-disjoint paths from p i to p j and from p k to p l in U (2) n ( 1 ). As before, the labeling of nodes V = {1, . . . , n} in G is according to a topological order of the nodes.
Firstly, each two neighboring poles of the EUG, p 2s and p 2s+1 for s ∈ {1, . . . , n 2 }, are thought of as merged poles, so-called superpoles, with their fanin and fanout becoming 2. In a similar manner, any G ∈ 1 (n) graph can be regarded as a 2 ( n 2 ) graph with supernodes, i.e., each pair (2s, 2s + 1) will be merged into one node in a 2 ( n 2 ) graph G = (V , E ). If there are edges between the nodes in G, they are simulated with loops. The set of edges of this graph G is partitioned to disjoint sets E 1 and E 2 , such that , respectively. This can be done efficiently, as shown in Theorem 1. The edges in E 1 are embedded as directed paths in R 1 n 2 −1 , and the edges in E 2 as directed paths in R 2 n 2 −1 . Both E 1 and E 2 have at most one edge directed into and at most one directed out of any supernode, and therefore, there is only one edge from E 1 and one from E 2 to be simulated going through any superpole in U (2) n ( 1 ) as well. Thus, the edge coming into a superpole ( p 2s , p 2s+1 ) in E 1 is embedded as a path through r 1 s−1 , while the edge going out of the pole in E 1 is embedded as a path through r 1 s in the appropriate subgraph. Similarly, the edges in E 2 are simulated as edges through r 2 s−1 and r 2 s . These paths can be chosen disjoint according to the induction hypothesis. Finally, the paths from r 1 s−1 and r 2 s−1 to superpole ( p 2s−1 , p 2s ) as well as the paths from ( p 2s−1 , p 2s ) to r 1 s and r 2 s can be chosen edge-disjoint due to the skeleton built up of the body blocks shown in Fig. 3. With this, Valiant's graph construction results in a valid EUG with asymptotically optimal size O(n log n) and depth O(n) [66]. With the building blocks described in Sect. 3.2, it is easy to see that the resulting Boolean circuit is universal.

Implementation.
We provided an open-source implementation of this 2-way UC optimized for PFE in [45]. In concurrent and independent related work, Lipmaa et al. [46] also showed the practicality of Valiant's 2-way UC. They decrease its total number of gates compared to that of Valiant's block (Fig. 3) by one XOR gate. However, the number of AND gates is exactly the same, and therefore, their improvement does not affect PFE using UCs, where XOR gates are evaluated for free [44].

Valiant's 4-way UC Construction
Similarly to the 2-way EUG construction (cf. Sect. 3.3), Valiant provides a more efficient 4-way EUG or UC construction [66] for 1 (n) graphs which can be extended to an EUG for 2 (n) graphs by utilizing two instances U (4) n ( 1 ) 1 and U (4) n ( 1 ) 2 as described in Sect. 3.1. U (4) n ( 1 ) has a 4-way recursive structure, i.e., at each recur-  Fig. 5a). The recursion base is the same as for the 2-way UC construction described in Sect. 3.1. This construction results in UCs of smaller size ∼ 4.75n log 2 n but has a more complicated structure and programming algorithm. We have studied and implemented this universal circuit in [31] and recapitulate our results here and in Sect. 7. Valiant offers the main, so-called body block B (4) consisting of 4 poles (large circles), 15 nodes (small circles) as well as 8 recursion points (rectangles) shown in Fig. 5a. As before, we provide so-called head H (4) and tail T (4) blocks that occur at the top and bottom of a skeleton in [31,], respectively. The blocks are connected such that the 4 top (resp. bottom) recursion points of one block are the 4 bottom (resp. top) recursion points of the next block. Similarly to the 2-way EUG, 4 sets are created for n nodes, i.e., R 1 Recently, Zhao et al. in [72] optimized the body block of Valiant's UC by finding a more efficient block using exhaustive search over all possible blocks. As opposed to Valiant's UC that uses 15 additional nodes in the body block, their block uses only 14 additional nodes, and therefore, their UC achieves an asymptotically better size of ∼ 4.5n log 2 n. We depict the further optimized body block B (4) of Zhao et al. in Fig. 5b. Zhao et al. provide a computer generated proof of that this block can indeed be used to construct universal circuits. Moreover, they show that there exists no block with only 13 additional nodes that can be used to construct UCs in the same manner. This proves that the minimal size of a 4-way UC is the achieved ∼ 4.5n log 2 n.
The proof of this theorem is analogous to that of Theorem 4.

Programming Valiant's Universal Circuits
We designed the detailed embedding algorithm and the open-source UC implementation of [45] specifically for the 2-way UC, dealing with the whole UC skeleton as one block. In contrast, based on the modular design of [46], we modularized the edge-embedding task into multiple subtasks and described how they can be performed separately in [31]. In this section, we detail this modular approach for edge-embedding a graph into Valiant's -way EUG, where = 2 or = 4: The edge-embedding can be split into two parts, which are then combined.
In the following, we describe the two main steps of our modular approach presented in [31] that are based on the edge-embedding algorithm of [45]. 1) Block edgeembedding (Sect. 4.1) allows for the programming of the blocks visualized in Fig. 3 on p. 12 and in Figs. 5a or b on p. 14.2) Recursion point edge-embedding (Sect. 4.2) takes care of the programming of the whole UC. Here, the paths are defined and the necessary information is provided to the blocks (cf. Sect. 4.2). The process can be generalized to any 2 i -way EUG. Moreover, the same modular edge-embedding algorithm can be applied with a few modifications for Lipmaa et al.'s generalization to any k-way UC [46], which we describe later in Sect. 5.1.

Block Edge-Embedding
We consider the top (resp. bottom) recursion points of a block (Figs. 3 and 5a or b) as intermediate nodes where the inputs (resp. outputs) of the block enter (resp. exit). The blocks are built so that any of these inputs can be forwarded to exactly one of the poles of the block and the output of any pole can be forwarded to an output or another pole with a higher topological order.
We formalize this behavior as follows: In U ( ) The value 0 of the input and output vectors is a dummy value which is used if there is no specific path between an input and a pole, or between a pole and an output of B ( ) . The output vector has a larger value range, since a pole can be forwarded to another pole or an output recursion point. Therefore, we use values 1, . . . , −1 for poles p i+2 , . . . , p i+ and values , . . . , 2 −1 for the output recursion points. Pole p i+1 cannot be a destination for a path in B ( ) , since η U ( p i+1 ) is less than the topological order of any other pole in B ( ) . Additionally, the values of in and out need to be pairwise different or 0. Every combination of input and output vector covering the conditions formalized below in Eqs. 2-6 is valid for B ( ) . A pair (r l i , p j ) ∈ P or ( p j , r l i+1 ) ∈ P is a path from r l i to p j or p j to r l i in the set of all paths P in B ( ) . Then, P ( ) B ⊆ P denote the paths that are to be edgeembedded (cf. Sect. 3.1). PolePolePath: PoleOutPath: InDiff:

Recursion Point Edge-Embedding
Block edge-embedding covers only the programming of the nodes within the blocks of the UC. Another task is to program the recursion points. We use the construction of [45] which, in every step, splits a 2 (n) graph in two 1 (n) graphs, which are merged to two 2 ( n 2 − 1 ) graphs. This, as described later, results in a tree of graphs with fanin and fanout one or two called supergraph [45]. We use this supergraph for defining the paths in Valiant's 2-way EUG. For Valiant's 4-way EUG, we use every second step of the algorithm with a minor modification. We describe our modular algorithm for the 2-way and 4-way UCs below and in Listing 1.
Let C k u,v be the Boolean circuit computing function f that our UC needs to compute and G ∈ 2 (n) its graph representation (cf. Sect. 2.2).
1. Splitting G ∈ 2 (n) in two 1 (n) graphs G 1 and G 2 : As described in Sect. 3.1, Valiant's UC is derived from an EUG for 2 (n) graphs, which is built up of two EUGs (U ( ) n ( 1 )) 1 and (U ( ) n ( 1 )) 2 for 1 (n) graphs merged by their poles. G is similarly split into two 1 (n) graphs G 1 and G 2 , which then need to be edgeembedded into (U ( ) is split by 2-coloring its edges [45,66], which can always be done due to Kőnig's theorem [38,48] recapitulated in Theorems 1 and 2 on p. 7-8. After 2-coloring, E is divided into sets E 1 and E 2 , using which we build G 1 = (V, E 1 ) and G 2 = (V, E 2 ), with the following conditions: In an EUG, the number of poles decreases in each recursion step and merging a 1 (n) graph into a 2 ( n 2 − 1 ) graph provides information about the paths to be taken. Let two nodes in G 1 are mapped to one node in G m . At last, we define a mapping θ E that maps an edge and j < i, e is removed from E , along with the last node v n 2 (due to the definition of θ E , it does not have any incoming edges). The resulting G m is a topologically ordered graph in 2 ( n 2 − 1 ).

The supergraph for Valiant's EUG construction.
In the first step, G is split into two 1 (n) graphs G 1 and G 2 . G 1 and G 2 contain all the edges that should be embedded as paths between poles in the first and second EUGs for 1 (n), respectively. We now explain how to edge-embed the 1 (n) graph G 1 into an EUG U ( ) n ( 1 ) (for G 2 it is analogous). For edge-embedding in the 2-way EUG, G 1 is first merged to a 2 ( n 2 −1 ) graph G m . G m is then 2-colored and split into two 1 ( n 2 − 1 ) graphs G 1 1 and G 2 1 [45]. These get merged to two graphs G 1 m and G 2 m , which are then 2-colored and split into two 1 ( n 2 −1 2 − 1 ) graphs. These steps are repeated until the recursion base is reached.
In the supergraph, G are the first and second subgraphs of G ψ 1 for any ψ, respectively.
In Valiant's 4-way EUG construction [66], a supergraph that creates 4 subgraphs in each step is necessary. We require a merging method where a 1 (n) graph is merged to a 4 ( n 4 − 1 ) graph where 4 nodes build a new node, and 4-color this graph to retrieve 4 subgraphs. However, this can directly be solved by using the method described above from [45]: After repeating the 2-coloring and the merging twice, we gain 4 subgraphs (G 11 1 , G 12 1 , G 21 1 and G 22 1 ). These can be used as if they were the result of 4-coloring the graph obtained by merging every 4 nodes into one.
However, there is a modification in this case: The first 2-coloring is a preprocessing step, which does not map to an EUG recursion step. Therefore, we have to define another Listing 1. Edge-embedding algorithm for Valiant's -way EUG.
Let S be the set of the 1 subgraphs of G 1 in the supergraph 3 Let R be the recursion step graphs 4 Let B be the set of blocks in U 5 for Let i and j denote the positions of v i and v j in their blocks Set the control bit of r x 0 to 1 19 Set the control bit of r x 1 to y 25 , since in this preprocessing step we need to keep node v n 2 . Then the creation of the supergraph for the 4-way EUG construction works as follows: We merge G 1 to a 2 ( n 2 ) graph with labeling η in and η out P and get G m . After that, we split G m into two 1 ( n 2 ) graphs G 1 1 and G 2 1 . These get merged to 2 ( n 4 − 1) graphs G 1 m and G 2 m using the η in and η out labelings. Finally, these two graphs get split into 4 1 ( n 4 − 1 ) graphs G 11 1 , G 12 1 , G 21 1 , and G 22 1 . These are the relevant graphs for the first recursion step in Valiant's 4-way EUG construction. Then we continue for all 4 subgraphs until we reach the recursion base.
-way Edge-Embedding Algorithm. In Listing 1, we combine block edge-embedding and recursion point edge-embedding.
Let U denote the part of U ( ) n ( 1 ) without recursion steps (the main skeleton) and A recursion step graph of U is one of the graphs having one of the sets of recursion points as poles (e.g., r 1 1 , . . . , r 1 n −1 ) without the recursion steps. R denotes the set of all recursion step graphs of U, and B denotes the set of all blocks in U.
We give a brief explanation of Listing 1 that describes the edge-embedding process. For any edge e = (v i , v j ) ∈ E in G 1 , b i and b j denote the block numbers in which v i and v j are. We distinguish between two cases: Case 1. v i and v j are in the same block: b i = b j . The edge-embedding is solved within the block, and no recursion points have to be programmed for the path. Therefore, is not yet used for an edge-embedding. This determines that the path in the next recursion step has to be between poles p b i and p b j−1 . We denote with s ∈ S the subgraph of G 1 which contains e and x denotes its number in S, i.e., S[x] = s. This implies in which of the recursion step graphs we need to edge-embed the path from p b i to p b j−1 , and so which recursion points we need to program. We first set the control bit of the xth input (resp. output) recursion points to 1 since the path between the poles with labeling i and j enters (resp. exits) the next recursion step over this recursion point. A special case to be considered here is when blocks B[b i ] and B[b j ] are neighbors (i.e., b j = b i +1). Then, the path enters and leaves the next recursion step graph at the same node, whose control bit thus has to be 0. The output vector of block B[b i ] is the i th value to the xth recursion point, and the input vector of block B[b j ] is the xth value to the j th pole in this block.
We repeat these steps for all edges e ∈ E. Since all input and output vector of all blocks in B are set, they can be embedded with the block edge-embedding. For all subgraphs of G 1 in the supergraph and in the EUG, we call the same procedure with

Extensions to Valiant's UC Constructions
Here, we describe ideas for novel UC constructions and implementations. Firstly, in Sect. 5.1, we describe the k-way generalization of Valiant's UC presented by Lipmaa et al. in [46]. In Sect. 5.2, we describe our modular building blocks for a potentially more efficient 3-way UC. We show that Valiant's optimized U 3 ( 1 ) cannot directly be applied as a building block in the construction due to the fact that it must have an additional node to be part of a generic EUG. We prove that the EUG without this node is not a valid EUG by showing a counterexample. Therefore, it actually results in a worse asymptotic size than Valiant's 2-way and 4-way UCs [66]. Thereafter, in Sect. 5.3, we propose a hybrid UC, utilizing both Valiant's 2-way and 4-way UCs or Valiant's 2-way and Zhao et al.'s 4-way UC [72] so that the overall size of the resulting hybrid UC is minimized and is at least as efficient as the better construction for the given size (in Sect. 6.2 we show its concrete improvement). Finally, in Sect. 5.4, we propose a different modular and scalable approach of Valiant's 4-way UC. This approach requires a lot of modifications in the UC generation and programming algorithm, but can be generalized to any k-way UC or to our hybrid UC.

Generalized k-way UC
In [46], Lipmaa et al. generalize Valiant's approach by providing a UC with any number of recursion points k, the so-called k-way EUG or UC. We note that their construction slightly differs from Valiant's EUG, since they do not consider the restriction on the fanout of the poles, i.e., the nodes in the EUG that correspond to universal gates or inputs (cf. Sect. 3.1). This optimization has also been included in [45] when translating an EUG to a UC, but including it in the block design leads to better sizes for the number of XOR gates. This, however, does not make a difference in case of our most prominent application of private function evaluation (PFE) (cf. Sect. 1.1), where XOR gates are free, i.e., do not require cryptographic operations and communication.
The idea is to split n = u + v + g in m = n k blocks as shown in Fig. 6. Every block i consists of k inputs r 1 i , r 2 i , . . . , r k i and k outputs r 1 i+1 , r 2 i+1 , . . . , r k i+1 as well as k poles, except for the last block which has a number of poles depending on n mod k. For every j ≤ k, the list of all r j i builds the poles of the jth subgraph of the next recursion step, i.e., we have k subgraphs. Additionally, every block begins and ends with a Waksman permutation network [67] such that the inputs and outputs can be permuted to any pole. A Y-switching block is placed in front of every pole p i which is connected to the ith output of the permutation network as well as the ith output of a block-intern EUG U k ( 1 ). This means that Lipmaa et al. in [46] reduce the problem of finding an efficient k-way EUG U (k) n ( 2 ) block B (k) to the problem of finding the smallest EUG U k ( 1 ). Their solution is to build the block-intern EUG with the UC of [44], which was claimed to be more efficient for smaller circuits than [66]. Moreover, they calculate the optimal k value to be around 3.147 with their construction, which implies that the best solutions are found using small EUGs, for which Valiant provides hand-optimized solutions (i.e., for k = 2, 3, 4, 5, 6) [66].
We note that the results recently presented by Zhao et al. [72] do not fit into this generalized k-way construction. Therefore, Zhao et al.'s optimized 4-way block is an optimization over Valiant's modular 4-way block construction [66].

Programming the Generalized UC
In this section, we extend the recent work of [46] by providing a detailed and modular embedding mechanism for any k-way EUG construction. We provide the main differences to the edge-embedding of the 2-way and 4-way EUG detailed in Sect. 4.  two 1 (n) graphs G 1 and G 2 : Similarly as in Sect. 4.2, we first split G into two 1 (n) graphs G 1 and G 2 with 2-coloring.
The mapping of the edges θ E is the same as in the 2-way and 4-way EUG construction, and (v i , v j ) ∈ E where j < i edges are removed along with v n k in the end. G m is then a topologically ordered graph in 1 ( n k − 1 ).

The supergraph for Lipmaa et al.'s k-way EUG construction The next step of the construction is to split
According to Kőnig's theorem [38,48] described in Sect. 2.1, k (n) graphs can always be k-colored efficiently with a dedicated algorithm. The rest of the supergraph construction and the way it is used for edge-embedding is the same as for the 2-way and 4-way EUG as described in Sect. 4.2.
k-way Edge-Embedding Algorithm. The edge-embedding algorithm is the same as shown in Listing 1, with = k.

Potentially More Efficient 3-Way UC
The optimal k value for minimizing the size of the k-way UC was calculated to be 3.147 in [46]. We describe our idea of a 3-way UC. Intuitively, based on an optimization by Valiant [66], this UC should result in the best asymptotic size. The asymptotic size of any k-way UC depends on the size of its modular body block B (k) (e.g., Fig. 5a or b on p. 14 for the 4-way UC). Once it is determined, the size of the UC is size(U (k) n ( 2 )) = 2 · size(U (k) n ( 1 )) ∼ 2 · size(B (k) ) k n log k n = 2 · size(B (k) ) k log 2 (k) n log 2 n. The modular block consists of two permutation networks P (k) , an EUG U k ( 1 ), and (k − 1) Y-switching blocks (cf. Sect. 5.1, [46]). 2 U 3 ( 1 ). According to Valiant [66], an EUG U 3 ( 1 ) with 3 poles contains only three-connected poles (used as recursion base in Sect. 3.1). An optimal permutation network P (3) that achieves the lower bound has 3 nodes as well. This implies that size(B (k) ) = 2 · P (3) + size(U 3 ( 1 )) + (3 − 1) = 11. Then, the size of the UC becomes ∼ 2 · 11 3 log 2 3 n log 2 n ∼ 4.627n log 2 n, which means an asymptotically by around 2.5% smaller size than that of Valiant's 4-way UC with ∼ 4.75n log 2 n.

Size of Body Block B (3) with Valiant's Optimized
However, there is a flaw in this initial design. Valiant's U 3 ( 1 ) only works as an EUG for 3 nodes under special conditions, e.g., when it is a subgraph within a larger EUG. There are 3 possible edges in a topologically ordered graph G = (V, E) in 1 (3): (1,2), (2,3) and (1,3). (1, 2) and (2, 3) can be directly embedded in U 3 ( 1 ) using ( p 1 , p 2 ) and ( p 2 , p 3 ), respectively. (1, 3), however, has to be embedded as a path through node 2, i.e., as a path (( p 1 , p 2 ), ( p 2 , p 3 )). When U 3 ( 1 ) is a subgraph of a bigger EUG, this is possible by programming p 2 accordingly. However, when we use this U 3 ( 1 ) as a building block in the body block of our EUG, it cannot directly be applied, due to the fact that the programming of p 2 depends on other constraints as well. A generic U 3 ( 1 ) that can embed (1, 3) without going through p 2 as before has an additional Y-switching block between p 2 and p 3 .
We depict in Fig. 7a the 3-way body block that uses Valiant's optimized U 3 ( 1 ) in the k-way block design of [46] and show that it is not a valid body block for an EUG construction. Assume that the output of pole p 3i+1 has to be directed to pole p 3i+3 (green path). Then, it needs to go through pole p 3i+2 , which means that the red edge going to p 3i+2 is used by this path. However, there can be an other edge coming from the permutation network as an input to p 3i+2 , e.g., from p 3i from the preceding block through r 1 i (blue path). This cannot be directed to p 3i+2 anymore, as shown in Fig. 7a, since the red edge would carry two different values. Therefore, in the 3-way body block construction, it does not suffice to use Valiant's optimized U 3 ( 1 ) [66]. Size of Body Block B (3) with Our Generic U 3 ( 1 ). In Fig. 7b, we show the 3-way body block with the generic U 3 ( 1 ) that allows the output from p 3i+1 to be directed to p 3i+3 without having to go through p 3i+2 (green path), and the edge going into p 3i+2 can be utilized by the path directed into this node (blue path). This results in size(B (3) ) = 2 · P (3) + size(U 3 ( 1 )) + (3 − 1) = 12, which implies that the size of the UC is ∼ 2 · 12 3 log 2 3 n log 2 n = 5.047n log 2 n. Unfortunately, this is even worse than the size of the 2-way UC with ∼ 5n log 2 n, and we therefore conclude that the most efficient known UC is Valiant's 4-way UC with Zhao et al.'s optimization.
Recently, Zhao et al. [72] have shown by exhaustive search over all possible topologies that the 3-way body block B (3) presented in Fig. 7b results in the smallest 3-way UC by showing that no block with only 11 additional nodes can be used as a universal block, and indeed, our block with 12 additional nodes can be utilized.

2/4 Hybrid UC Construction
In this section, we detail our hybrid UC based on Valiant's 2-way and 4-way UCs with the optimization by Zhao et al. [72], which yields the smallest UCs to date. Given the size of the input circuit C g u,v , i.e., n = u + v + g, we can calculate at each recursion step if it is better to create 2 subgraphs of size n 2 − 1 and utilize the 2-way recursive skeleton, or it is more beneficial to create a 4-way recursive skeleton with 4 subgraphs of size n 4 − 1 . We assume that for every n, we have an algorithm that computes the size (i.e., size(U hybrid(K ) n ( 1 ))) of the hybrid UC for sizes smaller than n. We give details on how it is computed in Sect. 6. Then, Listing 2 describes the algorithm for constructing a hybrid UC, at each step based on which strategy is more efficient. We note that our hybrid construction is generic, and given multiple k-way UCs as parameter K (K = {2, 4} in our example), it minimizes the concrete size of the resulting UC.

Scalable 4-way UC Construction
Our existing implementations of [31,45] store the whole UC of size O(n log n) in memory, which therefore becomes a bottleneck when it comes to scalability. In this section, we present the design of our scalable universal circuit construction. Specifically, we show how Valiant's 4-way UC can be modified to use O(n) memory in the input circuit size n at each step of the execution. We note that our approach is generic, and with additional implementation effort, it can be extended to any k-way UC as well as for the 4-way UC of Zhao et al. [72].
In this section, we present our design that utilizes two separate phases. The first phase is scalable UC generation (Sect. 5.4.1), where the universal circuit is generated given the size n of the input circuit. This is solved by generating the topologically ordered UC layer by layer, each of which has size O(n). The output of this step is a set of circuit files, which all contain a subgraph of size O(n), which helps to significantly reduce the complexity of the second phase, i.e., scalable UC programming (Sect. 5.4.2). In this step, the subcircuits resulting from the first phase are programmed individually, i.e., we proceed subcircuit by subcircuit instead of edge by edge of the input circuit as before. Therefore, the output of this step is a set of programming files that contain the programming bits respective to the circuit files. In Sect. 7.2, we will show experimentally that our scalable UC construction significantly reduces the memory usage. , where further subgraphs are created. We note that the nodes are shown only for one of the four subgraphs, but they are the same for all four subgraphs. Scalable head and tail blocks are designed analogously.

Scalable Per-Block UC Generation
The underlying idea behind our scalable UC generation is to generate the blocks of the main skeleton one by one, only keeping one such block and its corresponding subgraph nodes in memory at once. In this scenario, these blocks will be regarded as layers. Additionally, we store some necessary information from the preceding three layers in dedicated files, but delete these as soon as they become redundant. The required additional information is the topological order of nodes that are already defined and have edges directed into the current layer. Since the number of subgraphs in any layer is O(n), the number of nodes held in memory at any point is O(n) as well, since in each layer there are only a constant number of nodes.
Our scalable UC generation relies on the fact that at each block of the main skeleton, based on the modulo 4 result for each next recursion step, we know which part of the next subgraph skeleton or potentially recursion base graph we build at each layer. This observation helps us reconstruct how the subgraphs may look like for a given body block in Valiant's 4-way UC. Since the structure of this is complicated and there are many cases to consider, we show in Fig. 8 the cases for Valiant's body block from Fig. 5a on p. 14 [66] and note that head and tail blocks can be constructed analogously. Moreover, a similar scalable design can be constructed for Zhao et al.'s body block (Fig. 5b) [72]. Figure 8d shows a recursive block construction with Figs. 8b, c being base cases. From  Fig. 8, each body block construction type is denoted by B i where i = {0, 1, 2, 3} 3 is the  In the following, we use an example to detail how our scalable UC generation works. We depict the resulting UC files and what their content is in Table 1.
Generation of first (main) skeleton. Generating the first (main) skeleton of the two U n ( 1 ) EUGs that are merged into a U n ( 2 ), EUG differs from the next, recursive steps. Let us consider an example of a DAG with n = u + k + v = 36. Ideally, our approach constructs twice the same block from the left and right U n ( 1 ) EUGs. In this scenario for U n ( 1 ), we have one (merged) head block H , seven (merged) body blocks B, and one (merged) tail block T 4 with 4 nodes in the main skeleton. Constructing the first head block is straightforward according to [31, Fig. 4e] as we do not have to construct any subgraph. Thereafter, we construct seven body blocks according to Fig. 5a and a tail block according to [31, Fig. 4f]. However, these merged blocks require constructing the subgraph nodes in the same layer alongside with it, as we describe next. Note that in this first step, we actually generate twice the four sets of subgraph nodes, since the two U n ( 1 ) EUGs are merged into a U n ( 2 ) EUG (cf. Sect. 3.1), but in later recursion steps, only four sets of subgraph nodes are generated.
Generating subgraph nodes recursively per layer. We can generate the subgraph nodes recursively for all recursion steps at a given position for nodes n. In our example with n = 36, we only have a head and a tail block for the recursion graph with n−4 4 = 8 poles. Therefore, we construct the first body block with H 0 as subgraph level, the second body block with H 1 , thereafter H 2 and H 3 . The fifth body block is constructed with T 0 , the sixth and seventh with T 1 and T 2 , respectively, and the tail block with T 3 . Recursive scalable blocks are H 3 and B 3 as shown in Fig. 8d. T 3 4 does not have recursion points anymore, since a tail block has no output recursion points. For n = 8, we reach a recursion base with n−4 4 = 1. However, for a larger n, more recursion steps might be necessary. Therefore, at each layer, we generate all subgraph nodes necessary, and if a recursion step, i.e., H 3 or B 3 , occurs, we generate the nodes of the next subgraph as well, etc. We denote the recursion bases by R 1 , R 2 , R 3 , and R 4 with 1, 2, 3, and 4 nodes, respectively.
With this, we have shown how to generate topologically ordered universal circuits using the file system and achieve a scalable algorithm for UC generation that stores at most O(n) information in memory. Moreover, our approach requires 4.75n log 2 n disk space to store the universal circuit as before, and additionally O(n) extra storage space for every layer. However, we only store additional data for the prior three layers and delete any other stored data at each step. In the end of the UC generation, we can delete any additionally stored data. The maximum storage requirement for our algorithm is before deleting the additionally stored data for the last layer, since the size of the UC dominates the storage requirements at any other step (when only a part of it is generated yet).

Scalable UC Programming
As described in Sect. 5.4.1, we design our scalable UC generation such that each subgraph is written into a separate file. This is important to also allow the programming step to require only O(n) memory. It can be observed in Listing 1 on p. 17 that the recursion point edge-embedding algorithm inherently handles the UC subgraph by subgraph (cf. Sect. 4.2), which in turn calls the block edge-embedding for all blocks in a subgraph. We observe that each skeleton can be programmed based on the information stored only in the corresponding 1 graph, and therefore, we can store the programming bits in a separate file for each subgraph in the same order as the nodes of the subgraph.
After reading a subgraph from its file resulting from the UC generation step detailed in Sect. 5.4.1, it is programmed as described in Listing 1. The embedding starts from the main skeleton in file f 0 and continues with f 1 , . . . , f 4 and g 1 , . . . , g 4 , etc., and results in the corresponding programming files p 0 , p 1 , . . . , p 4 and q 1 , . . . , q 4 , etc.

Size and Depth of UCs
In this section, we review the size and depth of the UCs considered in this article. The size of the edge-universal graph U into a UC, the first u poles are associated with inputs, the last v poles with outputs, and the g poles between are realized with universal gates (cf. Eq. 1 on p. 11) whose programming is defined by the corresponding gates in the simulated circuit. The rest of the nodes of U (k) n ( 2 ) are translated into universal programmable (X and Y) switching blocks (cf. Fig. 2 on p. 11), whose programming is defined by the edge-embedding of the graph of the circuit G into U (k) n ( 2 ). Thus, when considering the sizes and depths of the UCs, we realize the nodes and poles as circuit building blocks and express the concrete and asymptotic sizes in the number of switches (X and Y ) and universal gates (U ) (cf. Sect. 3.2).
In Sect. 6.1, we recapitate the asymptotic size and depth of Valiant's 2-way and 4-way UCs [66], i.e., UC  and UC  , respectively, of Zhao et al.'s 4-way UC UC   [72] and of the smallest k-way UCs following Lipmaa et al.'s generalization [46]. Thereafter, in Sect. 6.2, we present optimizations that reduce the size (and potentially the depth as well) of UCs, regardless of which constructions were used for their generation. We revise the concrete sizes and depths of UC

Asymptotic Size and Depth of k-Way UCs
Lipmaa et al.'s k-way UC [46] is discussed briefly in Sect. 5.1 and is depicted in Fig. 6 on p. 19. They show that a k-way body block may consist of two permutation networks P (k) , an EUG for k nodes, i.e., U k ( 1 ), and additionally, (k − 1) Y-switching blocks. In this section, we recapitulate the sizes in Table 2 and depths in Table 3 of these building blocks and give an estimate for the leading constant for Lipmaa et al.'s k-way EUGs and UCs with size O(n log 2 n) and depth O(n), for k ∈ {2, . . . , 8}. We conclude that among all UCs following this generalization, the best size is achieved by Valiant's 4-way UC, UC  . This does not exclude the possibility for a more efficient UC, as has been shown in [72], where Zhao et al. propose a 4-way UC, UC  , using a smaller body block. Therefore, their construction achieves the smallest asymptotic size to date. However, Zhao et al. state that their method cannot be used yet to find more efficient UCs for k > 4, since it includes an exhaustive search for which the domain becomes too large.

Edge-Universal Graph with k Poles
Size. Valiant optimized EUGs up to size 6 by hand in [66]: For k = 2, U 2 ( 1 ) has two poles, for k = 3 we discussed in Sect. 5.2 that an additional node is necessary. For k ∈ {4, 5, 6}, the sizes are {6, 10, 13}, as shown in [45, Fig. 1] (the nodes denoted as empty circles disappear in the UC). For k = 7 and k = 8, we observe that UC  results in a better size than that of UC  due to the smaller permutation network and less recursion nodes. Therefore, we use these constructions to compute the size of U 7 ( 1 ) and U 8 ( 1 ). As mentioned in [46], another possibility is to use the UC of [44] instead of these EUGs since they have better sizes for small circuits. These UCs U KS08 k are built from two smaller U KS08 k 2 , a P ( k 2 ) and k 2 Y switches [44]. It results in a smaller size of 21 for k = 8. Depth. The depth of the hand-optimized EUGs for k ∈ {2, 3, 4, 5, 6} is, respectively, {2, 4, 5, 7, 10} as shown in [45, Fig. 1]. The depth of U 7 ( 1 ) and U 8 ( 1 ) becomes, respectively, 16 and 19 with Valiant's 2-way UC, and 14 and 16 with the UC from [44]. Table 2. Leading term of the asymptotic O(n log 2 n) sizes of k-way edge-universal graphs (U (k) n ( 1 )) and universal circuits (UC) and the concrete size of their building blocks for k ∈ {2, . . . , 8} according to the design of [46].  [67]. B (k) is the k-way body block with the best existing alternative for universal circuits and permutation networks marked in bold 6.1.

Permutation Networks P (k)
Size. Waksman in [67] showed that the lower bound for the size of a permutation network is log 2 (k!) for k elements. We show this lower bound in Table 2 as P (k) l . The size of the smallest existing permutation network is Waksman's permutation network P (k) W [7,67]. For k ∈ {2, 3, 4}, its size matches the lower bound, but for larger values of k, P (k) W uses additional nodes. Depth. The depth of a permutation network has lower bound log 2 (k!) + 1, since each input has to have a path to each output, where switches have only two inputs and two outputs. We show these as the depth of P (k) l in Table 3. Waksman's permutation network matches the lower bound when k ∈ {2, 3, 4}, but utilizes additional nodes for larger values of k.

Body Blocks
A body block B (k) is built of (k − 1) Y-switching blocks, an EUG for k nodes, and two permutation networks P (k) [46] (cf. Fig. 6 on p. 19). B (k) shown in Tables 2 and 3 is built using Waksman's permutation network P (k) W . Size. The size of the body block is the sum of the sizes of its building blocks, i.e., size(B (k) ) = min size(U k ( 1 )), size(U KS08 Depth. The depth of B (k) is the number of edges in its building blocks, the additional edges between the different blocks and the recursion nodes. This means that in total depth(B (k) ) = min depth(U k ( 1 )), depth(U KS08 k ) + 2 · depth(P (k) ) + (k − 1) · depth(Y ) + 1. Table 3. Leading terms of the asymptotic O(n) depths of k-way edge-universal graphs (U (k) n ( 1 )) and universal circuits (UC) and the concrete depth of their building blocks for k ∈ {2, . . . , 8} according to the design of [46].  [67]. B (k) is the k-way body block with the best existing alternative for universal circuits and permutation networks marked in bold

Edge-Universal Graphs and Universal Circuits with n Poles
Two k-way EUGs U k log 2 k n log 2 n. The leading factor for a size(UC) is twice this number, since asymptotically, the number of switches in the UC is the same as the number of nodes in U (k) n ( 2 ), which is summarized in Table 2. We use Waksman's permutation network P (k) W when calculating the size of the UC, however, even with the lower bound P Depth. The depths of the EUG and of the UC depend only on the depth of the main skeleton, not on the subgraphs, since the longest path is between p 1 and p n in the outest skeleton. Therefore, the asymptotic depths of EUG U  Table 3.

Concrete Size and Depth of UCs
In this section, we consider formulae for the concrete sizes and depths of Valiant's UCs, i.e., UC  and UC   [66], Zhao et al.'s method UC   [72], and our hybrid universal circuits UC H 4) [31] and UC H  . Beforehand, we describe two optimizations.

Optimization for Fanin-1 Nodes
We observe that in U (k) n ( 1 ) there is a fanin-1 node in the head block (cf. [31, Fig. 2c and 4e] for UC  and UC  , respectively). A similarly designed head block for Zhao et al.'s optimized UC   [72] has three such fanin-1 nodes (cf. in Fig. 19a in "Appendix B"). Moreover, fanin-1 nodes exist in the base cases for a small number of poles as well [45]. These nodes are important to achieve fanin and fanout 2 of the graph, but can be replaced with wires when translated into a circuit description as described in Sect. 3.2. Since at least one such node can be ignored in each subgraph when nodes are translated into gates, this results in at least k · log k n−1 i=0 k i ∼ kn less gates for the universal circuit, where n = u + v + g. We include this optimization in our calculations further on. This improvement decreases the depth of the UC only by a few gates.

Optimization for Input and Output Nodes
In the skeleton of Valiant's UC, the poles corresponding to circuit inputs need no ingoing edges and those corresponding to circuit outputs need no outgoing edges. Therefore, since u, v and g are publicly known, we optimize by deleting nodes that become redundant while canceling the edges going to the first u (input) and coming from the last v (output) nodes. The exact number of redundant switching nodes depends on the parity or modulo 4 of u, v, n = u + v + g, and the k-way UC, but is O(u + v) in both 1 (n) edge-universal graphs that build up the graph of the UC. This optimization also improves the depth by O(u + v).

Concrete Sizes and Depths of 4-way and 2-way UCs
We realize that based on the parity (2-way UC) and the remainder modulo 4 (4-way UC), not only the size of the outest skeleton, but also that of the smaller subgraphs can be optimized by introducing so-called head and tail blocks (cf. Sect. 3.3 and Sect. 3.4). We considered this in our 2-way UC in [45], and we now generalize the approach for k-way UCs. We provide a recursive formula for the concrete size of the optimized k-way EUG as follows. Let m k be Then, given the designed head, body, and tail blocks (cf. [31,Figs. 2 and 4]) with sizes and depths shown in Table 4, we can compute the size by calculating the sizes of all the components of the outest skeleton, and the sizes of the smaller subgraphs with the recursive formula in Eq. 14. 4 As described in Sect. 3.1, a UC is constructed by means of an EUG U (k) n ( 2 ), which is in turn constructed from two EUGs with fanin and fanout one, U (k) n ( 1 ), by merging their poles together and thus taking them only once into consideration. When constructing a UC for circuit C g u,v , the number of inputs u, the number of outputs v, and the number of gates g with fanin and fanout 2 are public. Thus, using Valiant's construction, U (k) n ( 2 ) with n = u + v + g poles is constructed, and thus, our formula for the concrete size of U (k) and the size of the UC is where X , Y , and U denote X-, Y-switching blocks and universal gates (cf. Sect. 3.2), respectively, and size(Y ) ≤ size(X ) ≤ size(U ).
The depth of a k-way UC also depends on m k , the head, tail and body blocks (cf. [31, Figs. 2 and 4]), but not on the subgraphs. Thus, it is calculated using the formula in Eq. 17.

Concrete Size and Depth of Our 2/4 Hybrid UC
In Sect. 5.3, we provide a construction for minimizing the concrete size of the resulting 2/4 hybrid UC. The construction chooses at each step the skeleton that results in the smallest size. We provide the formula for determining its size using a dynamic programming algorithm in Eq. 19. Size(H (k) (i)), size(T (k) (i)) and size(B (k) (i)) are values from Table 4 for k = 2 and k = 4. Its depth is the depth of the outest skeleton, either of the 4-way or 2-way UC, depending on which is chosen first. Figure 9 shows the concrete improvement in percentage of UC  and UC  over UC  up to ten million nodes in the simulated input circuit. All reported averages are for the interval n ∈ {15, . . . , 10 7 }. From the asymptotic leading factors in Table 2, we expect an improvement of up to 5% for UC  and up to 10% for UC  . In Table 5, we depict the minimum, average, and maximum improvement compared to the asymptotic improvement in the interval n ∈ {2, . . . , 10 7 }. For the smallest n values (n ≤ 15), UC Valiant-2 is better than both 4-way UCs. However, with growing values of n, the 4-way UCs are better, except for some short intervals as shown in Fig. 9 Fig. 9 and summarized in Table 5. For some n values, our  hybrid UCs achieve the same size as the 2-way or corresponding 4-way UCs, but due to their nature, their improvement is always nonnegative, and greater than or equal to the improvement achieved by the 4-way UC. Moreover, in most cases our hybrid UCs result in better sizes than the underlying 4-way UC, which means that some subgraphs are created for an n for which the 2-way UC is smaller. The overall improvement over UC   We note that our hybrid UC can also be used to reduce the depth of the UC by utilizing the 2-way UC, UC  , in the first step of the construction. This results in the smallest asymptotic depth ∼ 3n (cf. Table 3).

Implementation and Evaluation of Our UC Compiler
In this section, we detail the challenges faced while demonstrating the practicality of Valiant's and Zhao et al.'s universal circuits. We show how to construct a UC and program it according to a standard circuit description. We validate our results with a practical implementation that, upon receiving a fanin-2 circuit Cg u,v as input, outputs the corresponding 2-way or 4-way UC UC Valiant-2 , UC  or UC  and its programming c f . We have provided the first implementation of Valiant's 2-way UC of size ∼ 5n log 2 n in [45] and implemented Valiant's 4-way UC of smaller size ∼ 4.75n log 2 n in a modular way in [31].
In this work, we extend our implementation with the modular 2-way UC and include the optimized 4-way UC of Zhao et al. [72] with size ∼ 4.5n log 2 n. We then combine the modular 2-way UC with both 4-way UCs in an implementation of our hybrid UC proposed in [31] and Sect. 5.3, i.e., UC H 4) and UC H (Valiant-2, Zhao et al.-4) , respectively. Moreover, we provide a prototype implementation of our scalable 4-way UC from Sect. 5.4, which can be generalized to both the 2-way UC and Zhao et al.'s improvement.

UC Compiler
The architecture of our UC compiler is depicted in Fig. 10. In this section, we briefly describe its different artifacts and its use of the Fairplay [51] or CBMC-GC [10,23] frameworks as a frontend. For a more detailed description, the reader is referred to [45]. Our implementation is available online at https://encrypto.de/code/UC. 1. Compiling Input Circuits from High-Level Functionality. We can use the Fairplay compiler [11,51] with the FairplayPF extension [44] or the CBMC-GC compiler [10,23] to translate the functionality described in a high-level language to the Fairplay circuit description called Secure Hardware Definition Language (SHDL). These compilers output a circuit Cg u,v with fanin 2, which is required for all UCs. However, due to Valiant's design, the input circuit C g u,v to our UC compiler has to have fanout 2 as well, i.e., the outputs of all gates and inputs can only be used as the input of at most two subsequent gates. This can be achieved using copy gates such that instead ofg gates, we haveg ≤ g ≤ 2g + v fanout-2 gates (cf. Sect. 2.2). We give concrete examples in [45] on how this conversion affects the size of practical circuits and show that in most cases, the resulting number of gates remains significantly below the upper bound 2g + v. 2. Obtaining the 2 (n) Graph G of the Circuit C g u,v . As next step, we transform circuit C g u,v into a 2 (n) graph G = (V, E) with n = u +v +g (cf. Sect. 3.1). This can directly be generated as described in Sect. 2.2: With the number of inputs u, outputs v, and gates g in circuit C g u,v , G has n nodes and the wires are represented as edges in the graph. Then, we define a topological order η G on the nodes of G such that every input node v i has a topological order of 1 ≤ η G (v i ) ≤ u and every output node v j is labeled with has fanin and fanout 2, the resulting graph G is in 2 (n), where n = u + v + g. It is possible in the modified SHDL circuit description that an internal value becomes two times the first or two times the second input of gates. Therefore, when a value is the second time the same input to a gate (i.e., first or second), both the two inputs and the two middle bits of the function table of the gate must be reversed (i.e., to compute f (in 1 , in 2 ) instead of f (in 2 , in 1 )) for the correct programming of the UC in Step 5. ( 1 ), respectively, as described in Sect. 3.1. The two instances get merged to U ( ) n ( 2 ) so that one builds the left inputs and outputs and the other builds the right inputs and outputs of the gates (based on the two-coloring of G). For efficiency reasons, we directly generate the merged edge-universal graph, i.e., an EUG for 2 (n), with the poles as common nodes. We partly include our optimization for the input and output nodes from Sect. 6.2.2 5 and Valiant's optimizations for the base cases n ∈ {2, 3, 4}, but do not consider Valiant's optimizations for n ∈ {5, 6} [66]. Knowing the number of input bits u, the number of gates g, and the number of output bits v, we construct the corresponding edge-universal graph U n ( 2 ), where n = u + v + g. We note that no knowledge is necessary about the topology or the gate tables in circuit C for this step. 4. Programming U n ( 2 ) and U hybrid(K ) n ( 2 ) According to an Arbitrary 2 (n) Graph. We edge-embed graph G into U ( ) n ( 2 ) as described in Sect. 4 and into our 5 We delete edges coming into inputs and going out from outputs. Due to this, some nodes are removed due to our fanin-1 optimization from Sect. 6.2.1 when translated into a UC. hybrid U hybrid(K ) n ( 2 ) with K = {2, 4} as described in Sect. 5.3. G is partitioned into two 1 (n) graphs G 1 and G 2 which are embedded into the two EUGs U n ( 1 ) 1 and U n ( 1 ) 2 . Valiant proved in [66] that any topologically ordered 1 (n) graph can be edge-embedded in an EUG U n ( 1 ) (cf. Sect. 3.1). We perform the embedding as described in Sect. 4 for Valiant's 2-way and 4-way EUGs in Listing 1. The difference when using Zhao et al.'s improvement [72] is the block edgeembedding described in Sect. 4.1. Here, we utilize a lookup table derived from the computer generated proof of Zhao et al. [72] that maps the in and out vectors as defined in Sect. 4.1 into the programming bits of the block, i.e., can be used as block edge-embedding along with the recursion point edge-embedding described in Sect. 4.2. We edge-embed G 1 and G 2 into our 2/4-hybrid EUGs U hybrid(K ) n ( 1 ) 1 and U hybrid(K ) n ( 1 ) 2 as described in Sect. 5.3. When the edge-embedding is finished, we define the control bits of the programmable blocks (universal gates and switches) as described in Sect. 3.2.

Generating the Output Circuit Description and the Programming of the Universal
Circuit. After embedding the graph of the simulated circuit into the edge-universal graph U n ( 2 ), we write the resulting circuit in a file using our generic UC description. In the edge-universal graph, each node stores the control bit resulting from the edge-embedding (control bit c of the corresponding universal switch in Sect. 3.2) and each pole corresponding to a gate stores four bits (the four control bits of the function table of the corresponding gate in the original circuit C g u,v , c 0 , c 1 , c 2 , c 3 in Eq. 1, their order possibly changed in Step 2). Thus, after topologically ordering U n ( 2 ), one can directly write out the gate identifiers into a circuit file UC and the control bits to a programming file c f . We include our optimization from Sect. 6.2.1 and ignore extra nodes with fanin 1 when the graph is translated into a UC description. This improves the size of the recursion bases for n = {4, 5, 6} as well as of the head blocks [31, Fig. 2c and Fig, 4e] and Fig. 19a in "Appendix B." Our circuit description format is generic, i.e., consists of universal switches and universal gates. Therefore, any framework can be adapted to use them, independently from if it is interpreted as a Boolean or arithmetic UC. We start with enumerating the client input wires as C 0 1 . . . u − 1. As a reminder, the O(n log n) server input wires are in the programming file c f . In the UC, we have universal gates denoted by U , universal switches denoted by X or Y depending on the number of outputs (X with two outputs and Y with one): X in 1 in 2 out 1 out 2 (21) Y in 1 in 2 out 1 (22) denotes that wire out 1 (and possibly out 2 ) is coming from a gate with input wires in 1 and in 2   circuit and its programming are given in plain text files as shown in Listings 3 and 4 in "Appendix C."

Experimental Evaluation
We ran all experiments for our UC compiler on a Desktop PC, equipped with an Intel Core i7-4790 CPU with 3.6 GHz and 32 GB RAM, and provide our results in this section.
We performed experiments for circuit sizes n ∈ {10, 100, . . . , 1 000 000} as well as with notable circuits from [65] such as the AES-128 circuit without key expansion with size n = 38 518 and the SHA-256 circuit with size n = 201 206. We note that these sizes are for the circuits transformed to have fanin and fanout 2 as described in Sect. 2.2 and in [45, Table 1]. Circuit Sizes (Fig. 11). We first compare the circuit sizes of our implementations that slightly differ from the expected sizes shown in Sect. 6. Our initial 2-way UC  implementation from [45] included the recursion bases for 1, 2, and 3 nodes and, however, did not include those proposed by Valiant [66] optimized for 4, 5, and 6 nodes. It included both size optimizations described in Sects. 6.2.1 and 6.2.2. In Fig. 11, we show the improvement over our UC  implementation from [45] in percentage of the number of switches of our later, more modular UC implementations presented in this article and in [31]. We note that the number of universal gates is the same for all implementations, i.e., the number of gates in the original circuits g.
Our modular 4-way UC  implementation from [31] additionally included the recursion base with 4 nodes and, however, only partly included the optimization described in Sect. 6.2.2 concerning the input and output nodes. The edges directed into the inputs and out of the outputs are also removed which results in smaller sizes due to the thus redundant nodes, however, not all unnecessary connections are deleted. This, however, incurs only a small overhead of at most O(u + v). As we can observe in Fig. 11 and as expected (cf. Table 5 on p. 32), this implementation improved by around 5% over our implementation from [45].
In this article, we have first implemented the modular version of Valiant's 2-way UC  where inherently we use the optimized recursion base with 4 nodes as well. An around 1.5-2% improvement can be observed over our non-modular implementation from [45]. Using this and our modular 4-way UC  , we have implemented our hybrid UC H(Valiant-2,4) using Valiant's 2-way and 4-way UCs as proposed in [31]. This implementation has a more steady improvement of at least 5% for most tested circuit sizes. Moreover, we also implemented the optimized UC  proposed in [72], who have proved that their optimized block is universal by giving the programming for all possible path combinations in the block. We use this proof to generate a lookup table file for our implementation, which contains a mapping from any possible input-output vector (cf. Sect. 4.1) and the corresponding programming bits for the block. The generation of this lookup table is a one-time precomputation cost and takes around 82 seconds. In subsequent runs of the UC compiler, this overhead is no longer needed and a file of size 1.08 MB is read which takes only about 80 milliseconds. Thereafter, the expected gain of around 10% can be observed over our 2-way UC Valiant-2 implementation from [45]. Moreover, the hybrid variant with this construction, i.e., UC H  , achieves an at least 10% improvement for all our example circuits.
In Table 6, we show the concrete number of switches of the smallest UCs generated with UC H  as well as the sizes of the resulting UC and programming files. The universal circuit for n = 1 million gates has around 76 million switches and additionally around 1 million universal gates (which, in the PFE setting, results in a total of about 77 million AND gates for Yao's garbled circuit protocol and 79 million AND gates for the GMW protocol). The corresponding file for the UC has size 2.8 GB, and the programming file has size 0.15 GB. Runtime (Fig. 12). To compare the runtime of our UC implementation with that of the UC compiler of [45], we ran the same experiments on the same platform using our novel implementations for UC  , UC  , UC H 4) , and UC H  . Runtimes are reported as averages from 10 executions. The differences in runtimes for the different constructions are not significant, and therefore, we only depict the runtimes of our hybrid implementations UC H 4) and UC H  in Fig. 12.
The runtimes of our modular UC  and UC  implementations are very similar to those of UC H 4) , the latter of which becomes best for larger circuits   (Figs. 13, 14). We also implemented our scalable 4-way UC generation algorithm presented in Sect. 5.4. We note that our implementation only includes H i , T i x and B i x for i = 0, 1, 2, 3 and x = 4 and does not include the optimized versions for x = 1, 2, 3 which we leave as future work. Moreover, we include the base cases for n = 1, 2, 3 but not that for n = 4. This is due to the fact that a lot of engineering effort would be required for including the other options as well and our work is only a proof-of-concept implementation of our method presented in Sect. 5.4. Therefore, we test circuits with specific sizes where none of the other blocks or base case are required, i.e., where all subgraphs at each recursion step have 4 nodes in the tail block and the base case with n = 4 is not needed. Currently, for generating UCs for different sizes, one would need to pad the original circuit with dummy gates to an allowed size. Our aim was to improve the memory consumption of the UC generation (and programming) algorithm, while keeping the price paid in runtime as low as possible. The number of files created is the number of subgraphs in the UC, which is necessary for efficient scalable programming of the UC.
We show that our scalable UC generation implementation provides the expected improvement in memory usage by comparing our scalable UC  implementation to our implementation from [31]. We depict in Fig. 13 the memory usage of the generation algorithm with growing input circuit sizes on a machine with 32 GB RAM memory. As can be seen in the figure, instead of holding the whole UC of size O(n log n) in memory, we indeed hold only O(n) information in memory at each step. When using 1 GB, 8 GB, and 32 GB of memory, we can generate a UC for over 27×, 28×, and 29× larger input circuit sizes n, respectively. Moreover, as can be observed in Fig. 14, the runtime of the resulting scalable UC generation is only around 4× that of the UC  implementation of [31]. This difference is becoming smaller with increasing n due to the fact that the implementation of [31] is running short on memory and starts swapping to disk. Our experiments show that while reducing the memory requirements of our UC generation for UC  , we keep the runtime asymptotically the same (cf. Fig. 14). Moreover, the required storage capacity is also O(n log n) as before, since the additionally stored data at each step are at most O(n), cf. Sect. 5.4.

Toolchain for Private Function Evaluation
Secure function evaluation (SFE) allows two parties to jointly compute a public function on their private inputs, without revealing anything to each other apart from the output of the computation. As it is probably the most prominent application of UCs (cf. Sect. 1.1), we implement private function evaluation (PFE) using SFE of a Boolean universal circuit. In this scenario, one of the parties holds its input x and the other party holds the programming c f corresponding to a private function f that allows the UC to compute UC(x, c f ) = f (x). We note that the UC (with control bits for the universal gates and switches) can be publicly generated.
We have created a novel toolchain for private function evaluation (PFE) in [45], using the ABY framework for SFE (secure against semi-honest adversaries) as backend of our UC compiler. ABY implements state-of-the-art optimizations of Yao's garbled circuit protocol [69,70] and the GMW protocol [32]. We emphasize that our tool for constructing and programming UC is generic and can easily be adapted to other secure computation frameworks or other applications of UCs listed in Sect. 1.1.

Extension of the ABY Framework
We adapt the ABY secure two-party computation framework [19] for securely evaluating universal circuits. We realize the universal circuit building blocks (universal gates and switches) with a number of AND and XOR gates, which is the functionally complete set of logical gates that ABY uses. Since XOR gates can be evaluated for free in the underlying protocols for secure function evaluation due to the free-XOR optimization [43], from here on, we study the AND-size (size AND ) and AND-depth (depth AND ) of UCs, i.e., the number of AND gates and the maximum number of AND gates on the longest path, respectively. For other applications, however, the total sizes and depths of the UCs with respect to both AND and XOR gates are relevant. We implement universal gates and switches optimized for PFE and therefore use few AND gates, and only (free) XOR gates alongside it. X and Y gates are obtained as shown in [43] with size AND (Y ) = size AND (X ) = depth AND (Y ) = depth AND (X ) = 1 for both universal switches. In case the SFE implementation uses Yao's garbled circuit protocol [70], both size AN D (U ) = 1 and depth AN D (U ) = 1, due to the fact that in some garbling schemes (such as in the case of garbled 3-row reduction (GRR3) [55]) the evaluator does not learn the type of the evaluated gate. Therefore, a universal gate can be imple-mented using only one 2-input non-XOR gate [60]. For other SFE protocols such as GMW where this optimization is not possible, our efficient implementation of generic universal gates uses Y gates yielding with size AND (U ) = 3 and depth AND (U ) = 2. We note that the implementation of switches and universal gates might look very different when other 2-input Boolean gates can also be used, e.g., when other size metrics are to be minimized. We include our implementation of these efficient UC building blocks in the opensource ABY framework https://encrypto.de/code/ABY. For evaluating a UC securely, the output universal circuit file of our UC compiler is parsed, a circuit UC is generated and evaluated with the input x and the control bits c f to compute f (x). Our toolchain is the first implementation of Valiant's size-optimized UC that supports efficient private function evaluation [45].

Experimental Results
We validate the practicality of our implementation, which is the first practical implementation of private function evaluation (PFE), cf. Sect. 1.1. We ran our experiments on two Desktop PCs, each equipped with an Intel Core i9-7960X CPU with 2.8 GHz and 128 GB RAM. We give the runtimes in Fig. 15 and communication in Fig. 16 for our example circuits from the previous section, i.e., for random circuits of sizes n ∈ {10, 100, . . . , 1,000,000} as well as the AES and SHA-256 circuits from [65]. For completeness, we give the exact numbers in Table 7 in "Appendix D." Our runtime measurements are provided from an average of 10 executions, in two different settings: in a LAN setting with 10 Gbit/s bandwidth and 1 ms RTT, as well as in a simulated WAN setting with 100 Mbit/s bandwidth and 100 ms RTT.
We evaluate UCs in ABY [19] with both the GMW protocol [32] and Yao's garbled circuit protocol [69] with state-of-the-art optimizations. Yao's garbled circuit protocol achieves much better runtimes than that of the GMW protocol since the latter has O(n) rounds (i.e., the number of rounds is the depth of the circuit, and Valiant's UCs have depth O(n), cf. Sect. 6.1 and Table 7 in "Appendix D"), whereas Yao's protocol runs in 3 rounds. The effect of this is especially apparent in the WAN setting where the round-trip time is much higher. In both settings, the runtime of the GMW protocol is dominated by the linear term due to the linear number of online rounds. The amount of communication is similar in both implementations; however, it could be reduced by half for Yao's protocol if X and Y switches would be implemented with the optimization from [43] using only one ciphertext. The current implementation utilizes two ciphertexts per X and Y switches.
Due to the clear advantage of Yao's protocol over the GMW protocol, we highly recommend using Yao's protocol when evaluating UCs securely for PFE. Investigating depth-optimized UCs [17] with O(d) depth in the depth of the input circuit d could improve the performance of the GMW protocol; however, its number of rounds will still depend on d, whereas Yao's protocol runs in only 3 rounds.

Comparison of PFE Approaches
Mohassel et al. in [53] design a generic framework for PFE and apply it to three different scenarios: to the m-party GMW protocol [32], to Yao's garbled circuits [70], and to arithmetic circuits using homomorphic encryption [16]. Both the two-party versions of their framework with the GMW protocol and the one with Yao's garbled circuit protocol have two alternatives: Using homomorphic encryption, they achieve linear complexity O(n) in the circuit size n, and when using a solution solely based on oblivious transfers (OTs), they obtain a construction with O(n log n) symmetric-key operations. The OT-based construction in both cases is more desirable in practice, since OT extension reduces the number of expensive public-key operations significantly [2,36].
As the asymptotical complexity of this construction and using Valiant's UC for PFE is the same, we compare these methods for PFE. We revisit the formulas provided in [53] for the PFE protocol based on Yao's garbled circuits and elaborate on the number of symmetric-key operations when the different PFE protocols are used. Mohassel et al. show that the total number of switches in their framework is 4g log 2 (2g) + 1 that are evaluated using OT extension, for which they calculate 8g log 2 (2g) + 8 symmetric-key operations together with 5g operations for evaluating the universal gates with Yao's protocol. We count only the work of the party that performs most of the work, i.e., 4g symmetric-key operations for creating a garbled circuit withg gates and 3 symmetrickey operations (two calls to a hash function and one call to a pseudorandom function (PRF)) for each OT using today's most efficient OT extension of [2]. Hence, according to our estimations, the protocol of [53] requires 12g log 2 (2g) + 4g + 12 symmetric-key operations.
In the same way, we assume that in the case of PFE with UCs, for both the universal gates and switches, the garbler needs 4n symmetric-key operations. In this case, however, n = u + v + g, whereg ≤ g ≤ 2g + v. It is, therefore, difficult to directly compare complexities of specifically designed protocols withg fanin-2 gates and UCs where the input circuit is required to have fanout 2 as well. In Fig. 17, we therefore depict the minimum and maximum required number of symmetric-key operations for circuits with sizeg ∈ {10, 100, . . . , 1,000,000}. Moreover, we depict the concrete values with real-world circuits (AES-128 and SHA-256 from [65]) with UC with SFE, and note that for the other approaches the points lie on the corresponding line.
The protocol of [53] has been improved to achieve better communication in [6]. The communication of the protocol of [53] is (10g log 2g + 4g + 5) · 128, while that of  [53] and its optimized version from [6]. [6] is (6g log 2g + 0.5g + 3) · 128. For SFE with UC, we require one ciphertext per X and Y switches [43] and 3 · 2 ciphertexts per universal gates. Figure 18 depicts the comparison between the communication of SFE with UCs with minimum and maximum values depending on the relation of g andg as before and the alternatives of [53] and [6]. We can see that SFE with UCs always achieves the best communication, requiring 1.5-3× less communication than the improvement of [6].

Conclusion
Universal circuits (UCs) are highly relevant for various applications such as verifiable computation, attribute-based encryption, and private function evaluation (PFE) which can, for example, be used for privacy-preserving evaluation of diagnostic programs, proprietary software and in private database management systems. These applications require size-optimized universal circuits, first proposed by Valiant [66]. Since then, several optimizations appeared to further reduce the size of the UCs.
In this article, we revisit Valiant's original constructions and the optimizations later proposed by our previous works by Kiss and Schneider [45] and Günther et al. [31] as well as by Zhao et al. [72]. We have shown the practicality of Valiant's universal circuit constructions and its several improvements by providing the implementation of the most efficient UC to date with size ∼ 4.5n log 2 n in the input circuit size n. Moreover, we highly improve the memory consumption of our UC generation algorithm by designing and implementing a method that utilizes O(n) memory instead of the previous methods using O(n log n) memory.
Universal circuits for an input circuit size of one million can be generated and programmed within a matter of around 18 minutes on a standard PC and utilized in various applications. We demonstrate the practicality of PFE with the secure evaluation of UCs and show that such a large universal circuit can be evaluated within 1.3 and 5.9 minutes using Yao's garbled circuit protocol in LAN and WAN settings, respectively. gemeinschaft (DFG) -SFB 1119 CROSSING/236615297 and GRK 2050 Privacy & Trust/251805230, and by the BMBF and HMWK within CRISP and ATHENE. We thank Michael Zohner for helping with the implementation in ABY and the anonymous reviewers of EUROCRYPT'16, ASIACRYPT'17, and JoC for their helpful comments.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Cĝ u,v
The Boolean circuit that describes f with arbitrary fanin and fanout.

Cg u,v
The Boolean circuit that describes f with fanin 2 and arbitrary fanout. C g u,v The Boolean circuit that describes f with fanin and fanout 2. n Size of the simulated circuit C g u,v with fanin and fanout 2, n = u + v + g. d Depth of the simulated circuit C g u,v . G The 2 (n) graph of C g u,v where every input, output and gate is represented with a node and every wire is represented with an edge. ρ (n) The set of all graphs with fanin and fanout ρ and n nodes. U n ( ρ ) Edge-universal graph for ρ (n) graphs, used generally for Valiant's UC.

U (k)
n ( ρ ) k-way edge-universal graph for ρ (n) graphs. Set of all poles in U n ( ρ ). U A universal gate that computes any function with two inputs and one output, using four control bits c 0 , c 1 , c 2 , c 3 as in Eq. 1. X A two-output X-switching block that returns its two input values either in the same or in reversed order depending on control bit c. Y A one-output Y-switching block that returns one of the two input values depending on control bit c.

D Concrete Performance Measures for Private Function Evaluation
In this section, we provide the concrete performance measures used for depicting the runtimes and communication of PFE by securely evaluating UCs generated with UC H (  *Denotes cases where an experiment would have taken more than 5 hours and therefore was not performed