Efficient Binary Decision Diagram Manipulation in External Memory

We follow up on the idea of Lars Arge to rephrase the Reduce and Apply procedures of Binary Decision Diagrams (BDDs) as iterative I/O-efficient algorithms. We identify multiple avenues to simplify and improve the performance of his proposed algorithms. Furthermore, we extend the technique to other common BDD operations, many of which are not derivable using Apply operations alone, and we provide asymptotic improvements for the procedures that can be derived using Apply. These algorithms are implemented in a new BDD package, named Adiar. We see very promising results when comparing the performance of Adiar with conventional BDD packages that use recursive depth-first algorithms. For instances larger than 8.2 GiB, our algorithms, in parts using the disk, are 1.47 to 3.69 times slower compared to CUDD and Sylvan, exclusively using main memory. Yet, our proposed techniques are able to obtain this performance at a fraction of the main memory needed by conventional BDD packages to function. Furthermore, with Adiar we are able to manipulate BDDs that outgrow main memory and so surpass the limits of other BDD packages.


Introduction
A Binary Decision Diagram (BDD) provides a canonical and concise representation of a boolean function as an acyclic rooted graph.This turns manipulation of boolean functions into manipulation of directed acyclic graphs [10,11].
Their ability to compress the representation of a boolean function has made them widely used within the field of verification.BDDs have especially found use in model checking, since they can efficiently represent both the set of states and the state-transition function [11].Examples are the symbolic model checkers NuSMV [17,18], MCK [22], LTSmin [25], and MCMAS [31] and the recently envisioned symbolic model checking algorithms for CTL* in [3] and for CTLK in [24].Bryant and Marijn [13][14][15] also recently devised how to use BDDs to construct extended resolution proofs to verify the result of SAT and QBFsolvers.Hence, continuous research effort is devoted to improve the performance of this data structure.For example, despite the fact that BDDs were initially envisioned back in 1986, BDD manipulation was first parallelised in 2014 by Velev and Gao [44] for the GPU and in 2016 by Dijk and Van de Pol [19] for multi-core processors [12].
The most widely used implementations of decision diagrams make use of recursive depth-first algorithms and a unique node table [9,19,26,30,43].Lookup of nodes in this table and following pointers in the data structure during recursion both pause the entire computation while missing data is fetched [27,34].For large enough instances, data has to reside on disk and the resulting I/Ooperations that ensue become the bottle-neck.So in practice, the limit of the computer's main memory becomes the limit on the size of the BDDs.

Related Work
Prior work has been done to overcome the I/Os spent while computing on BDDs.Ben-David et al. [8] and Grumberg, Heyman, and Schuster [23] have made distributed symbolic model checking algorithms that split the set of states between multiple computation nodes.This makes the BDD on each machine of a manageable size, yet it only moves the problem from upgrading main memory of a single machine to expanding the number of machines.David Long [32] achieved a performance increase of a factor of two by blocking all nodes in the unique node table based on their time of creation, i.e. with a depth-first blocking.But, in [6] this was shown to only improve the worst-case behaviour by a constant.Minato and Ishihara [34] got BDD manipulation to work on disk by serializing the depth-first traversal of the BDDs, where hash tables in main memory were used to identify prior visited nodes in the input and output streams.With limited main memory these tables could not identify all prior constructed nodes and so the serialization may include the same subgraphs multiple times.This breaks canonicity of the BDDs and may also in the worstcase result in an exponential increase of the constructed BDDs.
Ochi, Yasuoka, and Yajima [37] made in 1993 the BDD manipulation algorithms breadth-first to thereby exploit a levelwise locality on disk.Their technique has been heavily improved by Ashar and Cheong [7] in 1994 and further improved by Sanghavi et al. [40] in 1996 to produce the BDD library CAL capable of manipulating BDDs larger than the main memory.Kunkle, Slavici and Cooperman [29] extended in 2010 the breadth-first approach to distributed BDD manipulation.
The breadth-first algorithms in [7,37,40] are not optimal in the I/O-model, since they still use a single hash table for each level.This works well in practice, as long as a single level of the BDD can fit into main memory.If not, they still exhibit the same worst-case I/O behaviour as other algorithms [6].
In 1995, Arge [5,6] proposed optimal I/O algorithms for the basic BDD operations Apply and Reduce.To this end, he dropped all use of hash tables.Instead, he exploited a total and topological ordering of all nodes within the graph.This is used to store all recursion requests in priority queues, s.t.they are synchronized with the iteration through the sorted input stream of nodes.Martin Šmérek attempted to implement these algorithms in 2009 exactly as they were described in [5,6].But, the performance of the resulting external memory symbolic model checking algorithms was disapointing, since the size of the unreduced BDD and the number of isomorphic nodes to merge grew too large and unwieldy in practice [personal communication, Sep 2021].

Contributions
Our work directly follows up on the theoretical contributions of Arge in [5,6].We simplify his I/O-optimal Apply and Reduce algorithms.In particular, we modify the intermediate representation, to prevent data duplication and to save on the number of sorting operations.Furthermore, we are able to prune the output of the Apply and decrease the number of elements needed to be placed in the priority queue of Reduce, which in practice improves space use and running time performance.We also provide I/O-efficient versions of several other standard BDD operations, where we obtain asymptotic improvements for the operations that are derivable from Apply.Finally, we reduce the ordering of nodes to a mere ordering of integers and we propose a priority queue specially designed for these BDD algorithms, to obtain a considerable constant factor improvement in performance.
Our proposed algorithms and data structures have been implemented to create a new easy-to-use and open-source BDD package, named Adiar.Our experimental evaluation shows that these techniques enable the manipulation of BDDs larger than the given main memory, with only an acceptable slowdown compared to a conventional BDD package running exclusively in main memory.

Overview
The rest of the paper is organised as follows.Section 2 covers preliminaries on the I/O-model and Binary Decision Diagrams.We present our algorithms for I/O-efficient BDD manipulation in Section 3 together with ways to improve their performance.Section 4 provides an overview of the resulting BDD package, Adiar, and Section 5 contains the experimental evaluation of our proposed algorithms.Finally, we present our conclusions and future work in Section 6.

The I/O-Model
The I/O-model [1] allows one to reason about the number of data transfers between two levels of the memory hierarchy, while abstracting away from technical details of the hardware, to make a theoretical analysis manageable.
An I/O-algorithm takes inputs of size N , residing on the higher level of the two, i.e. in external storage (e.g on a disk).The algorithm can only do computations on data that reside on the lower level, i.e. in internal storage (e.g.main memory).This internal storage can only hold a smaller and finite number of M elements.Data is transferred between these two levels in blocks of B consecutive elements [1].Here, B is a constant size not only encapsulating the page size or the size of a cache-line but more generally how expensive it is to transfer information between the two levels.The cost of an algorithm is the number of data transfers, i.e. the number of I/O-operations or just I/Os, it uses.
For all realistic values of N , M , and B the following inequality holds.
where sort(N ) N/B • log M/B (N/B) [1] is the sorting lower bound, i.e. it takes Ω(sort(N )) I/Os in the worst-case to sort a list of N elements [1].With an M/B-way merge sort algorithm, one can obtain an optimal O(sort(N )) I/O sorting algorithm [1], and with the addition of buffers to lazily update a tree structure, one can obtain an I/O-efficient priority queue capable of inserting and extracting N elements in O(sort(N )) I/Os [4].

Cache-Oblivious Algorithms
An algorithm is cache-oblivious if it is I/O-efficient without explicitly making any use of the variables M or B; i.e. it is I/O-efficient regardless of the specific machine in question and across all levels of the memory hierarchy [20].We furthermore assume the relationship between M and B satisfies the tall cache assumption [20], which is that With a variation of the merge sort algorithm one can make it cache-oblivious [20].Furthermore, Sanders [39] demonstrated how to design a cache-oblivious priority queue.

TPIE
The TPIE software library [45] provides an implementation of I/O-efficient algorithms and data structures such that the management of the B-sized buffers is completely transparent to the programmer.
Elements can be stored in files that act like lists and are accessed with iterators.Each iterator can write new elements at the end of a file or on top of prior written elements in the file.For our purposes, we only need to push elements to a file, i.e. write them at the end of a file.The iterator can traverse the file in both directions by reading the next element, provided has next returns true.One can also peek the next element without moving the read head.
TPIE provides an optimal O(sort(N )) external memory merge sort algorithm for its files.It also provides a merge sort algorithm for non-persistent data that saves an initial 2 • N/B I/Os by sorting the B-sized base cases before flushing them to disk the first time.Furthermore, it provides an implementation of the I/O-efficient priority queue of [39] as developed in [38], which supports the push, top and pop operations.

Binary Decision Diagrams
A Binary Decision Diagram (BDD) [10], as depicted in Fig. 1, is a rooted directed acyclic graph (DAG) that concisely represents a boolean function B n → B, where B = {⊤, ⊥}.The leaves contain the boolean values ⊥ and ⊤ that define the output of the function.Each internal node contains the label i of the input variable x i it represents, together with two outgoing arcs: a low arc for when x i = ⊥ and a high arc for when x i = ⊤.We only consider Ordered Binary Decision Diagrams (OBDD), where each unique label may only occur once and the labels must occur in sorted order on all paths.The set of all nodes with label j is said to belong to the jth level in the DAG.
If one exhaustively (1) skips all nodes with identical children and (2) removes any duplicate nodes then one obtains the Reduced Ordered Binary Decision Diagram (ROBDD) of the given OBDD.If the variable order is fixed, this reduced OBDD is a unique canonical form of the function it represents.[10] The two primary algorithms for BDD manipulation are called Apply and Reduce.The Apply computes the OBDD h = f ⊙ g where f and g are OBDDs and ⊙ is a function B × B → B. This is essentially done by recursively computing the product construction of the two BDDs f and g and applying ⊙ when recursing to pairs of leaves.The Reduce applies the two reduction rules on an OBDD bottom-up to obtain the corresponding ROBDD. [10]

I/O-Complexity of Binary Decision Diagrams
Common implementations of BDDs use recursive depth-first procedures that traverse the BDD and the unique nodes are managed through a hash table [9,19,26,30,43].The latter allows one to directly incorporate the Reduce algorithm of [10] within each node lookup [9,35].They also use a memoisation table to minimise the number of duplicate computations [19,30,43].If the size N f and N g of two BDDs are considerably larger than the memory M available, each recursion request of the Apply algorithm will in the worst case result in an I/O-operation when looking up a node within the memoisation table and when following the low and high arcs [6,27].Since there are up to N f • N g recursion requests, this results in up to O(N f • N g ) I/Os in the worst case.The Reduce operation transparently built into the unique node table with a find-or-insert function can also cause an I/O for each lookup within this table [27].This adds yet another O(N ) I/Os, where N is the number of nodes in the unreduced BDD.For example, the BDD package BuDDy [30] uses a linked-list implementation for its unique node table's buckets.The index to the first node of the ith bucket is stored together with the ith node in the table.Figure 2 shows the performance of BuDDy solving the Tic-Tac-Toe benchmark for N = 21 when given variable amounts of memory (see Section 5 for a description of this benchmark).This experiment was done on a machine with 8 GiB of memory and 8 GiB of swap.Since a total of 3075 MiB of nodes are processed and all nodes are placed consecutively in memory, then all BDD nodes easily fit into the machine's main memory.That means, the slowdown from 37 seconds to 49 minutes in Fig. 2 is purely due to random access into swap memory caused by the find-or-insert function of the implicit Reduce.
Lars Arge provided a description of an Apply algorithm that is capable of using only O(sort(N f • N g )) I/Os and a Reduce algorithm that uses O(sort(N )) I/Os [5,6].He also proves these to be optimal for the level ordering of nodes on

RAM Swap
Unique node-table size (GiB) Time (seconds) Figure 2: Running time of BuDDy [30] solving Tic-Tac-Toe for N = 21 on a laptop with 8 GiB memory and 8 GiB swap (lower is better).disk used by both algorithms [6].These algorithms do not rely on the value of M or B, so they can be made cache-aware or cache-oblivious when using underlying sorting algorithms and data structures with those characteristics.We will not elaborate further on his original proposal, since our algorithms are simpler and better convey the algorithmic technique used.Instead, we will mention where our Reduce and Apply algorithms differ from his.

BDD Operations by Time-forward Processing
Our algorithms exploit the total and topological ordering of the internal nodes in the BDD depicted in (1) below, where parents precede their children.It is topological by ordering a node by its label, i : N, and total by secondly ordering on a node's identifier, id : N.This identifier only needs to be unique on each level as nodes are still uniquely identifiable by the combination of their label and identifier.
We write the unique identifier (i, id ) : N × N for a node as x i,id .BDD nodes do not contain an explicit pointer to their children but instead the children's unique identifier.Following the same notion, leaf values are stored directly in the leaf's parents.This makes a node a triple (uid , low , high) where uid : N×N is its unique identifier and low and high : (N×N)+B are its children.The ordering in (1) is lifted to compare the uid s of two nodes, and so a BDD is represented by a file with BDD nodes in sorted order.For example, the BDDs in Fig. 1 would be represented as the lists depicted in Fig. 3.This ordering of nodes can be exploited with the time-forward processing technique, where recursive calls are not executed at the time of issuing the request but instead when the element in question is encountered later in the iteration through the given input file.This is done with one or more priority queues that follow the same ordering as the input and deferring recursion by pushing the request into the priority queues.The Apply algorithm in [6] produces an unreduced OBDD, which is turned into an ROBDD with Reduce.The original algorithms of Arge solely work on a node-based representation.Arge briefly notes that with an arc-based representation, the Apply algorithm is able to output its arcs in the order needed by the following Reduce, and vice versa.Here, an arc is a triple (source, is high, target ) (written as source is high − −− → target) where source : N × N, is high : B, and target : (N × N) + B, i.e. the source contains the unique identifier of internal nodes while the target either contains a unique identifier or the value of a leaf.We have further pursued this idea of an arc-based representation and can conclude that the algorithms indeed become simpler and more efficient with an arc-based output from Apply.On the other hand, we see no such benefit over the more compact node-based representation in the case of Reduce.Hence as is depicted in Fig. 4, our algorithms work in tandem by cycling between the node-based and arc-based representation.

Apply
Reduce Notice that our Apply outputs two files containing arcs: arcs to internal nodes (blue) and arcs to leaves (red).Internal arcs are output at the time of their target are processed, and since nodes are processed in ascending order, internal arcs end up being sorted with respect to the unique identifier of their target.This groups all in-going arcs to the same node together and effectively reverses internal arcs.Arcs to leaves, on the other hand, are output at the time of their source is processed, which groups all out-going arcs to leaves together.These two outputs of Apply represent a semi-transposed graph, which is exactly of the form needed by the following Reduce.For example, the Apply on the node-based ROBDDs in Fig. 1a and 1b with logical implication as the operator will yield the arc-based unreduced OBDD depicted in Fig. 5.
For simplicity, we will ignore any cases of leaf-only BDDs in our presentation of the algorithms.They are easily extended to also deal with those cases.

Apply
Our Apply algorithm works by a single top-down sweep through the input DAGs.Internal arcs are reversed due to this top-down nature, since an arc between two internal nodes can first be resolved and output at the time of the arc's target.These arcs are placed in the file F internal .Arcs from nodes to leaves are placed in the file F leaf .
The algorithm itself essentially works like the standard Apply algorithm.Given a recursion request for the pair of input nodes v f from f and v g from g, a single node v is created with label min(v f .uid.label, v g .uid.label ) and recursion requests r low and r high are creatd for its two children.If the label of v f .uidand v g .uidare equal, then r low = (v f .low, v g .low ) and r high = (v f .high,v g .high).Otherwise, r low , resp.r high , contains the uid of the low child, resp.the high child, of min(v f , v g ), whereas max(v f .uid, v g .uid ) is kept as is.
The RequestsFor function computes the requests r low and r high described above as shown in Fig. 6.It resolves the request for the tuple (t f , t g ) where at x 0,0 (x2,0, x0,0) x 1,0 (x2,0, x1,0) x 2,0 (x2,0, ⊥) internal arcs leaf arcs  least one of them identify an internal node.The arguments v f and v g are the nodes currently read, where we can only guarantee t f = v f .uidor t g = v g .uidbut not necessarily that both match.If the label of t f and t g are the same but only one of them matches, then low and high contains the children of the other.The surrounding pieces of the Apply algorithm shown in Fig. 7 are designed such that all the information needed by RequestsFor is made available in an I/O-efficient way.Input nodes v f and v g are read on lines 12 -14 from f and g as a merge of the two sorted lists based on the ordering in (1).Specifically, these lines maintain that t seek ≤ v f and t seek ≤ v g , where the t seek variable itself is a monotonically increasing unique identifier that changes based on the given recursion requests for a pair of nodes (t f , t g ).These requests are synchronised with this traversal of the nodes by use of the two priority queues Q app:1 and Q app:2 .Q app:1 has elements of the form (s The use for the boolean is high and the unique identifiers s, low , and high will become apparent below. The priority queue Q app:1 is aligned with the node min(t f , t g ), i.e. the one of t f and t g that is encountered first.That is, its elements are sorted in ascending order based on min(t f , t g ) of each request.Requests to the same (t f , t g ) are grouped together by secondarily sorting the tuples lexicographically.The algorithm maintains the following invariant between the current nodes v f and v g and the requests within The second priority queue Q app:2 is used in the case of t f .label= t g .labeland RequestsFor ( ( t f , tg ) , v f , vg , low , high , ⊙ ) r low , r high // Combine a c c u m u l a t e d i n f o r m a t i o n f o r r e q u e s t s i f tg ∈ {⊥, ⊤} ∨ ( t f ∈ {⊥, ⊤} ∧ t f .label < tg .label ) then r low ← ( v f .low , tg ) ; r high ← ( v f .high , tg ) e l s e i f t f ∈ {⊥, ⊤} ∨ ( t f .label > tg .label  f .id= t g .id, i.e. when RequestsFor needs information from both nodes to resolve the request but they are not guaranteed to be visited simultaneously.To this end, it is sorted by max(t f , t g ) in ascending order, i.e. the second of the two to be visited, and ties are again broken lexicographically.The invariant for this priority queue is comparatively more intricate.

∀(s
In the case that a request is made for a tuple (t f , t g ) with the same label but different identifiers, then they are not necessarily available at the same time.
Rather, the node with the unique identifier min(t f , t g ) is visited first and the one with max(t f , t g ) some time later.Hence, requests s Here, the request is extended with the low and high of the node min(v f , v g ) such that the children of min(v f , v g ) are available at max(v f , v g ), despite the fact that min(v f , v g ).uid < max(v f , v g ).uid .
The requests from Q app:1 and Q app:2 are merged on lines 10 with the TopOf   − → (t f :1 , t g:1 )) be the top element of Q app:1 and let the top element of Q app:2 be r 2 = (s < max(t f :2 , t g:2 ) and r 2 otherwise.If either one is empty, then it equivalently outputs the top request of the other.
When a request is resolved, then the newly created recursion requests r low and r high to its children are placed at the end of F leaf or pushed into Q app:1 on lines 28 -31, depending on whether it is a request to a leaf or not.All ingoing arcs to the resolved request are output on lines 33 -41.To output these ingoing arcs the algorithm uses the last two pieces of information that was forwarded: the unique identifier s for the source of the request and the boolean is high for whether the request was following a high arc.
The arc-based output greatly simplifies the algorithm compared to the original proposal of Arge in [6].Our algorithm only uses two priority queues rather than four.Arge's algorithm, like ours, resolves a node before its children, but due to the node-based output it has to output this entire node before its children.Hence, it has to identify its children by the tuple (t f , t g ) which doubles the amount of space used and it also forces one to relabel the graph afterwards costing yet another O(sort(N )) I/Os.Instead, the arc-based output allows us to output the information at the time of the children and hence we are able to generate the label and its new identifier for both parent and child.Arge's algorithm neither forwarded the source s of a request, so repeated requests to the same pair of nodes were merely discarded upon retrieval from the priority queue, since they carried no relevant information.Our arc-based output, on the other hand, makes every element placed in the priority queue forward a source s, vital for the creation of the transposed graph.Proposition 3.1 (Following Arge 1996 [6]).The Apply algorithm in Fig. 7 has where N f and N g are the respective sizes of the BDDs for f and g.
Proof.It only takes O((N f + N g )/B) I/Os to read the elements of f and g once in order.There are at most 2 • N f • N g many arcs being outputted into F internal or F leaf , resulting in at most O((N f • N g )/B) I/Os spent on writing the output.Each element creates up to two requests for recursions placed in Q app:1 , each of which may be reinserted in Q app:2 to forward data across the level, which totals The worst-case time complexity is derived similarly, since next to the priority queues and reading the input only a constant amount of work is done per request.

Pruning by Shortcutting the Operator
The Apply procedure as presented above, like Arge's original algorithm in [4,6], follows recursion requests until a pair of leaves are met.Yet, for example in Fig. 5 the node for the request (x 2,0 , ⊤) is unnecessary to resolve, since all leaves of this subgraph trivially will be ⊤ due to the implication operator.The later Reduce will remove any nodes with identical children bottom-up and so this node will be removed in favour of the ⊤ leaf.
This observation implies we can resolve the requests earlier for any operator that can shortcut the resulting leaf.To this end, one only needs to extend the RequestsFor to be the ShortcuttedRequestsFor function in Fig. 8 which immediately outputs a request to the relevant leaf value.This decreases the number of requests placed in Q app:1 and Q app:2 .Furthermore, it decreases the size of the final unreduced BDD, which again in turn will speed up the following Reduce.

Reduce
Our Reduce algorithm in Fig. 9 works like other explicit variants with a single bottom-up sweep through the unreduced OBDD.Since the nodes are resolved and output in a bottom-up descending order then the output is exactly in the reverse order as it is needed for any following Apply.We have so far ignored this detail, but the only change necessary to the Apply algorithm in Section 3.1 is for it to read the list of nodes of f and g in reverse.
The priority queue Q red is used to forward the reduction result of a node v to its parents in an I/O-efficient way.Q red contains arcs from unresolved sources s in the given unreduced OBDD to already resolved targets t ′ in the ROBDD under construction.The bottom-up traversal corresponds to resolving all nodes first sorted on s and secondly on is high; the latter simplifies retrieving low and high arcs on line 8 and 9, since the high arc always retrieved first.The base-cases for the Reduce algorithm are the arcs to leaves in F leaf , which follow the exact same ordering.Hence, on lines 8 and 9, arcs in Q red and F leaf are merged using the PopMax function that retrieves the arc that is maximal with respect to this ordering.
Since nodes are resolved in descending order, F internal follows this ordering on the arc's target when elements are read in reverse.The reversal of arcs in F internal makes the parents of a node v, to which the reduction result is to be forwarded, readily available on lines 26 -32.
The algorithm otherwise proceeds similarly to other Reduce algorithms.For each level j, all nodes v of that level are created from their high and low arcs, e high and e low , taken out of Q red and F leaf .The nodes are split into the two temporary files F j:1 and F j:2 that contain the mapping [uid → uid ′ ] from the unique identifier uid of a node in the given unreduced BDD to the unique identifier uid ′ of the equivalent node in the output.F j:1 contains the mapping from a node v removed due to the first reduction rule and is populated on lines 7 -13: if both children of v are the same then the mapping [v.uid → v.low ] is pushed to F j:1 .That is, the node v is not output and is made equivalent to its single child.F j:2 contains the mappings for the second rule and is resolved on lines 15 -24.Here, the remaining nodes are placed in an intermediate file F j and sorted by their children.This makes duplicate nodes immediate successors in F j .Every new unique node encountered in F j is written to the output F out before mapping itself and all its duplicates to it in F j:2 .Since nodes are output out-of-order compared to the input and without knowing how many will be output for said level, they are given new decreasing identifiers starting from the maximal possible value MAX ID.Finally, F j:2 is sorted back in order of F internal to forward the results in both F j:1 and F j:2 to their parents on lines 26 -32.
Here, MergeMaxUid merges the mappings [uid → uid ′ ] in F j:1 and F j:2 by always taking the mapping with the largest uid from either file.
Since the original algorithm of Arge in [6] takes a node-based OBDD as an input and only internally uses node-based auxiliary data structures, his Reduce algorithm had to create two copies of the input to create the transposed subgraph: one where the copies were sorted by the nodes' low child and one where they are sorted by their high child.The reversal of arcs in F internal merges these auxiliary data structures of Arge into a single set of arcs easily generated by the preceding Apply.This not only simplifies the algorithm but also more than halves the memory used and it eliminates two expensive sorting steps.We also apply the first reduction rule before the sorting step, thereby decreasing the number of nodes involved in the remaining expensive computation of that level.
Another consequence of his node-based representation is that his algorithm had to move all arcs to leaves into Q red rather than merging requests from Q red with the base-cases from F leaf .Let N ℓ ≤ 2N be the number of arcs to leaves then all internal arcs is 2N − N ℓ , where N is the total number of nodes.Our semi-transposed input considerably decreases the number I/Os used and time spent, which makes it worthwhile in practice.
It now suffices to focus on the latter half concerning N ℓ .The constant c involved in the Θ(sort(N ℓ )) of the priority queue must be so large that the following inequality holds for N ℓ greater than some given n 0 .
For the Ω(sort(N ℓ /B)) lower bound, we need to find a c ′ for N ℓ large enough to satisfy the following inequality.
The above is satisfied with Depending on the given BDD, this decrease can result in an improvement in performance that is notable in practice as our experiments in Section 5.3.3show.Furthermore, the fraction N ℓ /(2N ) only increases when recursion requests are pruned; our experiments in Section 5.3.4show that pruning increases this fraction by a considerable amount.Arge proved in [5] that this O(sort(N )) I/O complexity is optimal for the input, assuming a levelwise ordering of nodes (See [6] for the full proof).While the O(N log N ) time is theoretically slower than the O(N ) depth-first approach using a unique node table and an O(1) time hash function, one should note the log factor depends on the sorting algorithm and priority queue used where I/O-efficient instances have a large M/B log factor.

Other Algorithms
Arge only considered Apply and Reduce, since most other BDD operations can be derived from these two operations together with the Restrict operation, which fixes the value of a single variable.We have extended the technique to create a time-forward processed Restrict that operates in O(sort(N )) I/Os, which is described below.Furthermore, by extending the technique to other operations, we obtain more efficient BDD algorithms than by computing it using Apply and Restrict alone.

Node and Variable Count
The Reduce algorithm in Section 3.2 can easily be extended to also provide the number of nodes N generated and the number of levels L. Using this, it only takes O(1) I/Os to obtain the node count N and the variable count L.
Lemma 3.4.The number of nodes N and number of variables L occuring in the BDD for f is obtainable in O(1) I/Os and time.

Negation
A BDD is negated by inverting the value in its nodes' leaf children.This is an O(1) I/O-operation if a negation flag is used to mark whether the nodes should be negated on-the-fly as they are read from the stream.This is a major improvement over the O(sort(N )) I/Os spent by Apply to compute f ⊕ ⊤, where ⊕ is exclusive disjunction; especially since space can be shared between the BDD for f and ¬f .

Equality Checking
To check for f ≡ g one has to check the DAG of f being isomorphic to the one for g [10].This makes f and g trivially inequivalent when at least one of the following three statements are violated.
where N f and N g is the number of nodes in the BDD for f and g, L f and L g the number of levels, L f,i and L g,i the respective labels of the ith level, and N f,i and N g,i the number of nodes on the ith level.This can respectively be checked in O(1), O(1), and O(L/B) number of I/Os with meta-information easily generated by the Reduce algorithm in Section 3.

2.
In what follows, we will address equality checking the non-trivial cases.Assuming the constraints in (2) we will omit f and the g in the subscript and just write N , L, L i , and N i .
An O(sort(N )) equality check.If f ≡ g, then the isomorphism relates the roots of the BDDs for f and g.Furthermore, for any node v f of f and v g of g, if (v f , v g ) is uniquely related by the isomorphism then so should (v f .low, v g .low ) and (v f .high,v g .high).Hence, the Apply algorithm can be adapted to not output any arcs.Instead, it returns ⊥ and terminates early when one of the following two conditions are violated.
• The children of the given recursion request (t f , t g ) should both be a leaf or an internal node.Furthermore, pairs of internal nodes should agree on the label while pairs of leaves should agree on the value.
• On level i, exactly N i unique recursion requests should be retrieved from the priority queues.If more than N i are given, then prior recursions have related one node of f , resp.g, to two different nodes in g, resp.f .This implies that G f and G g cannot be isomorphic.
If no recursion requests fail, then the first condition of the two guarantees that f ≡ g and so ⊤ is returned.The second condition makes the algorithm terminate earlier on negative cases and lowers the provable complexity bound.Proof.Due to the second termination case, no more than N i unique requests are resolved for each level i in the BDDs for f and g.This totals at most N requests are processed by the above procedure regardless of whether f ≡ g or not.This bounds the number of elements placed in the priority queues to be 2N .The two complexity bounds then directly follow by similar analysis as given in the proof of Proposition 3.1 This is an asymptotic improvement on the O(sort(N 2 )) equality checking algorithm by computing f ↔ g with the Apply procedure and then checking whether the reduced BDD is the ⊤ leaf.Furthermore, it is also a major improvement in the practical efficiency, since it does not need to run the Reduce algorithm, s and is high in the s is high − −− → (t f , t g ) recursion requests are irrelevant and memory is saved by omitting them, it has no O(N 2 ) intermediate output, and it only needs a single request for each (t f , t g ) to be moved into the second priority queue (since s and is high are omitted).In practice, it performs even better since it terminates on the first violation.
A 2 • N/B equality check.One can achieve an even faster equality checking procedure, by strengthening the constraints on the node ordering.On top of the ordering in (1) on page 7, we require leaves to come in-order after internal nodes: and we add the following constraint onto internal nodes (x i,id 1 , low 1 , high 1 ) and (x i,id 2 , low 2 , high 2 ) on level i: This makes the ordering in (1) into a lexicographical three-dimensional ordering of nodes based on their label and their children.The key idea here is, since the second reduction rule of Bryant [10] removes any duplicate nodes then the low and high children directly impose a total order on their parents.
We will say that the identifiers are compact if all nodes on level i have identifiers within the interval [0, N i ).That is, the jth node on the ith level has the identifier j − 1.
Proposition 3.7.Let G f and G g be two ROBDDs with compact identifiers that respect the ordering of nodes in (1) extended with (3,4).Then, f ≡ g if and only if for all levels i the jth node on level i in G f and G g match numerically.
Proof.Suppose that all nodes in G f and G g match.This means they describe the same DAG, are hence trivially isomorphic, and so f ≡ g.Now suppose that f ≡ g, which means that G f and G g are isomorphic.We will prove the stronger statement that for all levels i the jth node on the ith level in G f and G g not only match numerically but are also the root to the very same subgraphs.We will prove this by strong induction on levels i, starting from the deepest level ℓ max up to the root with label ℓ min .
For i = ℓ max , since an ROBDD does not contain duplicate nodes, there exists only one or two nodes in G f and G g on this level: one for x ℓmax and/or one describing ¬x ℓmax .Since ⊥ < ⊤, the extended ordering in (4) gives us that the node describing x ℓmax comes before ¬x ℓmax .The identifiers must match due to compactness.
For ℓ min ≤ i < ℓ max , assume per induction that for all levels i ′ > i the j ′ th node on level i ′ describe the same subgraphs.Let v f,1 , v f,2 , . . ., v f,Ni and v g,1 , v g,2 , . . ., v g,Ni be the N i nodes in order on the ith level of G f and G g .For every j, the isomorphism between G f and G g provides a unique j ′ s.t.v f,j and v g,j ′ are the respective equivalent nodes.From the induction hypotheses we have that v f,j .low= v g,j ′ .lowand v f,j .high= v g,j ′ .high.We are left to show that their identifiers match and that j = j ′ , which due to compactness are equivalent statements.The non-existence of duplicate nodes guarantees that all other nodes on level i in both G f and G g differ in either their low and/or high child.The total ordering in ( 4) is based on these children alone and so from the induction hypothesis the number of predecessors, resp.successors, to v f,j must be the same as for v g,j ′ .That is, j and j ′ are the same.
The Reduce algorithm in Figure 9 already outputs the nodes on each level in the order that satisfies the constraint (3) and (4).It also outputs the jth node of each level with the identifier MAX ID−N i +j, i.e. with compact identifiers shifted by MAX ID − N i .Hence, Proposition 3.7 applies to the ROBDDs constructed by this Reduce and they can be compared in 2 • N/B I/Os with a single iteration through all nodes.This iteration fails at the same pair of nodes, or even earlier, than the prior O(sort(N )) algorithm.Furthermore, this linear scan of both BDDs is optimal -even with respect to the constant -since any deterministic comparison procedure has to look at every element of both inputs at least once.
As is apparent in the proof of Proposition 3.7, the ordering of internal nodes relies on the ordering of leaf values.Hence, the negation algorithm in Section 3.3.2breaks this property when the negation flags on G f and G g do not match.
Corollary 3.8.If G f and G g are outputs of Fig. 9 and their negation flag are equal, then equality checking of f ≡ g can be done in 2 • N/B I/Os and O(N ) time, where N is the size of G f and G g .
One could use the Reduce algorithm to make the negated BDD satisfy the preconditions of Proposition 3.7.However, the O(sort(N )) equality check will be faster in practice, due to its efficiency in both time, I/O, and space.

Path and Satisfiability Count
The number of paths that lead from the root of a BDD to the ⊤ leaf can be computed with a single priority queue that forwards the number of paths from the root to a node t through one of its ingoing arcs.At t these ingoing paths are accumulated, and if t has a ⊤ leaf as a child then the number of paths to t added to a total sum of paths.The number of ingoing paths to t is then forwarded to its non-leaf children.
This easily extends to counting the number of satisfiable assignments by knowing the number of variables of the domain and extending the request in the priority queue with the number of levels visited.For all N nodes there spent is O(1) time on every in-going arc.There are a total of 2N arcs, so O(N ) amount of work is done next to reading the input and using the priority queue.Hence, the O(N log N ) time spent on the priority queue dominates the asymptotic running time.Both of these operations not possible to compute with the Apply operation.

Evaluating and Obtaining Assignments
Evaluating the BDD for f given some x boils down to following a path from the root down to a leaf based on x.The levelized ordering makes it possible to follow a path starting from the root in a single scan of all nodes, where irrelevant nodes are ignored.Similarly, finding the lexicographical smallest or largest x that makes f ( x) = ⊤ also corresponds to providing the trace of one specific path in the BDD [28].If the number of Levels L is smaller than N/B then this uses more I/Os compared to the conventional algorithms that uses random access; when jumping from node to node at most one node on each of the L levels is visited, and so at most one block for each level is retrieved.

If-Then-Else
The If-Then-Else procedure is an extension of the idea of the Apply procedure where requests are triples (t f , t g , t h ) rather than tuples and three priority queues are used rather than two.The idea of pruning in Section 3.1.1can also be applied here: if t f ∈ {⊥, ⊤} then either t g or t h is irrelevant and it can be replaced with Nil such that these requests are merged into one node, and if t g = t h ∈ {⊥, ⊤} then t f does not need to be evaluated and the leaf can immediately be output.Proposition 3.12.The If-Then-Else procedure has , where N f , N g , and N h are the respective sizes of the BDDs for f , g, and h.
Proof.The proof follows by similar argumentation as for Proposition 3.1.

Restrict
Given a BDD f , a label i, and a boolean value b the function f | xi=b can be computed in a single top-down sweep.The priority queue contains arcs from a source s to a target t.When encountering the target node t in f , if t does not have label i then the node is kept and the arc s This algorithm has one problem with the pipeline in Fig. 4. Since nodes are skipped, and the source s is forwarded, then the leaf arcs may end up being produced and output out of order.In those cases, the output arcs in F leaf need to be sorted before the following Reduce can begin.This trivially extends to multiple variables in a single sweep by being given the assignments x in ascending order relative to the labels i.

Quantification
Given Q ∈ {∀, ∃}, a BDD f , and a label i the BDD representing Qb ∈ B : f | xi=b is computable using O(sort(N 2 )) I/Os by combining the ideas for the Apply and the Restrict algorithms.
Requests are either a single node t in f or past level i a tuple of nodes (t 1 , t 2 ) in f .A priority queue initially contains requests from the root of f to its children.Two priority queues are used to synchronise the tuple requests with the traversal of nodes in f .If the label of a request t is i, then, similar to Restrict, its source s is linked with a single request to (t.low , t.high).Tuples are handled as in Apply where Both conjunction and disjunction are commutative operations, so the order of t 1 and t 2 is not relevant.Hence, the tuple (t 1 , t 2 ) can be kept sorted such that t 1 < t 2 .This improves the performance of the comparator used within the priority queues since min(t 1 , t 2 ) = t 1 and max(t The idea of pruning in Section 3.1.1also applies here.But, in the following cases can a request to a tuple be turned back into a request to a single node t. • Any request (t 1 , t 2 ) where t 1 = t 2 is equivalent to the request to t 1 .
• As we only consider operators ∨ and ∧ then any leaf v ∈ {⊥, ⊤} that does not shortcut ⊙ is irrelevant for the leaf values of the entire subtree.
Hence, any such request (t, v) can safely be turned into a request to t.
This collapses three potential subtrees into one, which decreases the size of the output and the number of elements placed in the priority queues.This could also be computed with Apply and Restrict as f | xi=⊥ ⊙ f | xi=⊤ , which would also only take an O(sort(N ) + sort(N ) + sort(N 2 )) number of I/Os, but this requires three separate runs of Reduce and makes the intermediate inputs to Apply of up to 2N in size.

Composition
The composition f | xi=g of two BDDs f and g can be computed by reusing the ideas and optimisations from the Quantification and the If-Then-Else procedures.Requests start out as tuples (t g , t f ) and at level i they are turned into triples (t g , t f1 , t f2 ) which are handled similar to the If-Then-Else.The idea of merging recursion requests based on the value of t f1 and t f2 as for Quantification still partially applies here: if t f1 = t f2 or t g is a leaf, then the entire triple can boiled down to t f1 or t f2 .Proposition 3. 16.
where N f and N g are the respective sizes of the BDDs for f and g.
Proof.There are N2 f •N g possible triples, each of which can be in the output and in the priority queue.The two complexity bounds follows by similar analysis as in the proof of Proposition 3.12.

Alternatively, this is computable with the
)) I/Os if the Apply is used.But as argued above, computing the If-Then-Else with Apply is not optimal.Furthermore, similar to quantification, the constants involved in using Restrict for intermediate outputs is also costly.

Levelized Priority Queue
In practice, sorting a set of elements with a sorting algorithm is considerably faster than with a priority queue [36].Hence, the more one can use mere sorting instead of a priority queue the faster the algorithms will run.
Since all our algorithms resolve the requests level by level then any request pushed to the priority queues Q app:1 and Q red are for nodes on a yet-not-visited level.That means, any requests pushed when resolving level i do not change the order of any elements still in the priority queue with label i.Hence, for L being the number of levels and L < M/B one can have a levelized priority queue 2 by maintaining L buckets and keep one block for each bucket in memory.If a block is full then it is sorted, flushed from memory and replaced with a new empty block.All blocks for level i are merge-sorted into the final order of requests when the algorithm reaches the ith level.
While L < M/B is a reasonable assumption between main memory and disk it is not at the cache-level.Yet, most arcs only span very few levels, so it suffices for the levelized priority queue to only have a small preset k number of buckets and then use an overflow priority queue to sort the few requests that go more than k levels ahead.A single block of each bucket fits into the cache for very small choices of k, so a cache-oblivious levelized priority queue would not lose its characteristics.Simultaneously, smaller k allows one to dedicate more internal memory to each individual bucket, which allows a cache-aware levelized priority queue to further improve its performance.

Memory Layout and Efficient Sorting
A unique identifier can be represented as a single 64-bit integer as shown in Fig. 10.Leaves are represented with a 1-flag on the most significant bit, its value v on the next 62 bits and a boolean flag f on the least significant bit.A node is identified with a 0-flag on the most significant bit, the next ℓ-bits dedicated to its label followed by 62 − ℓ bits with its identifier, and finally the boolean flag f available on the least-significant bit.A node is represented with 3 64-bit integers: two of them for its children and the third for its label and identifier.An arc is represented by 2 64-bit numbers: the source and target each occupying 64-bits and the is high flag stored in the f flag of the source.This reduces all the prior orderings to a mere trivial ordering of 64-bit integers.A descending order is used for a bottom-up traversal while the sorting is in ascending order for the top-down algorithms.
A reasonable value for the number ℓ bits dedicated to the label is 24.Here there are still 2 62−ℓ = 2 38 number of nodes available per level, which is equivalent to 6 TiB of data in itself.

Adiar: An Implementation
The algorithms and data structures described in Section 3 have been implemented in a new BDD package, named Adiar3 , that supports the operations in Table 1.The source code for Adiar is publicly available at github.com/ssoelvsten/adiar and the documentation is available at ssoelvsten.github.io/adiar/ .
Interaction with the BDD package is done through C++ programs that include the <adiar/adiar.h>header file and are built and linked with CMake.Its two dependencies are the Boost library and the TPIE library; the latter is included as a submodule of the Adiar repository, leaving it to CMake to build TPIE and link it to Adiar.

Adiar function
Table 1: Supported operations in Adiar together with their I/O-complexity.N is the number of nodes and L is the number of levels in a BDD.An assignment x is a list of tuples (i, v) : The BDD package is initialised by calling the adiar init(memory, temp dir) function, where memory is the memory (in bytes) dedicated to Adiar and temp dir is the directory where temporary files will be placed, which could be a dedicated harddisk.After use, the BDD package is deinitialised and its given memory is freed by calling the adiar deinit() function.
The bdd object in Adiar is a container for the underlying files to represent each BDD.A bdd object is used for possibly unreduced arc-based OBDD outputs.Both objects use reference counting on the underlying files to both reuse the same files and to immediately delete them when the reference count decrements 0. By use of implicit conversions between the bdd and bdd objects and an overloaded assignment operator, this garbage collection happens as early as possible, making the number of concurrent files on disk minimal.
A BDD can also be constructed explicitly with the node writer object in O(N/B) I/Os by supplying it all nodes in reverse of the ordering in (1).

Experimental Evaluation
We assert the viability of our techniques by investigating the following questions.(d) Use of our improved equality checking algorithms.
3. How well do our algorithms perform in comparison to conventional BDD libraries that use depth-first recursion, a unique table, and caching?

Benchmarks
To evaluate the performance of our proposed algorithms we have created implemented multiple benchmarks for Adiar and other BDD packages, where the BDDs are constructed in a similar manner and the same variable ordering is used.The source code for all benchmarks is publicly available at the following url: github.com/ssoelvsten/bdd-benchmark The raw data and data analysis has been made available at [41].
N-Queens.The solution to the N -Queens problem is the number of arrangements of N queens on an N × N board, such that no queen is threatened by another.Our benchmark follows the description in [29]: the variable x ij represents whether a queen is placed on the ith row and the jth column and so the solution corresponds to the number of satisfying assignments of the formula where has threat(i, j) is true, if a queen is placed on a tile (k, l), that would be in conflict with a queen placed on (i, j).The ROBDD of the innermost conjunction can be directly constructed, without any BDD operations. Tic-Tac-Toe.
In this problem from [29] one must compute the number of possible draws in a 4 × 4 × 4 game of Tic-Tac-Toe, where only N crosses are placed and all other spaces are filled with naughts.This amounts to counting the number of satisfying assignments of the following formula.
where init (N ) is true iff exactly N out of the 76 variables are true, and L is the set of all 76 lines through the 4 × 4 × 4 cube.To minimise the time and space to solve, lines in L are accumulated in increasing order of the number of levels they span.The ROBDDs for both init (N ) and the 76 line formulas can be directly constructed without any BDD operations.Hence, this benchmark always consists of 76 uses of Apply to accumulate the line constraints onto init (N ).

Picotrav.
The EPFL Combinational Benchmark Suite [2] consists of 23 combinatorial circuits designed for logic optimisation and synthesis.20 of these are split into the two categories random/control and arithmetic, and each of these original circuits C o are distributed together with one optimised for size C s and one for optimised for depth C d .The last three benchmarks are the More than a Million Gates benchmarks, which we will ignore as they come without optimised versions.They also are generated randomly and hence they are unrealistic anyway.
We have recreated a subset of the Nanotrav BDD application as distributed with CUDD [43].Here, we verify the functional equivalence between each output gate of the original circuit C o and the corresponding gate of an optimised circuit C d or C s .Every input gate is represented by a decision variable and recursively the BDD representing each gate is computed.Memoisation ensures the same gate is not computed twice, while a reference counter is maintained for each gate to clear the memoisation table; this allows for garbage collection of intermediate BDDs.Finally, every pair of BDDs that should represent the same output are tested for equality.
The variable order used was chosen based on what produced the smallest BDDs during our initial experiments.We had the random/control benchmarks use the order in which the inputs were declared while the arithmetic benchmarks derived an ordering based on the deepest reference within the optimised circuit to their respective input gate; ties for the same level are resolved by a DFS traversal of the same circuit.

Hardware
We have run experiments on the following two very different kinds of machines.
• Consumer grade laptop with one quadro-core 2.6 GHz Intel i7-4720HQ processor, 8 GiB of RAM, 230 GiB of available SSD disk, running Ubuntu, and compiling code with GCC 9.3.0.
• Grendel server node with two 48-core 3.0 GHz Intel Xeon Gold 6248R processors, 384 GiB of RAM, 3.5 TiB of available SSD disk, running CentOS Linux, and compiling code with GCC 10.1.0.
The former, due to its limited RAM, has until now been incapable of manipulating larger BDDs.The latter, on the other hand, has enough RAM and disk to allow all BDD implementations to solve large instances on comparable hardware.

Experimental Results
All but the largest benchmarks were run multiple times and the minimum measured running time is reported, since it minimises any error caused by slowdown and overhead on the CPU and the memory [16].Using the average or median value instead has only a negligible impact on the resulting numbers.

Research Question 1:
Fig. 11a shows the 15-Queens problem being solved with Adiar on the Grendel server node with variable amounts of available main memory.cgroups are used to enforce the machines memory limit, including its file system cache, to be only a single GiB more than what is given to Adiar.During these computations, the largest reduced BDD is 19.4 GiB which makes its unreduced input at least 25.9 GiB in size.That is, the input and output of the largest run of Reduce alone occupies at least 45.3 GiB.Yet, we only see a 39.1% performance decrease when decreasing the available memory from 256 GiB to 2 GiB.This change in performance primarily occurs in the 2 GiB to 64 GiB interval where data needs to be fetched from disk; more than half of this decrease is in the 2 GiB to 12 GiB interval.
To confirm the seamless transition to disk, we investigate different N , fix the memory, and normalise the data to be µs per node.The current version of Adiar is implemented purely using the external memory algorithms of TPIE.These perform poorly when given small amounts of data; the time it takes to initialise the larger memory makes it by several orders of magnitude slower than if it used equivalent internal memory algorithms.This overhead is apparent for N ≤ 11.
The consumer grade laptop's memory cannot contain the 19.4 GiB BDD and its SSD the total of 250.3 GiB of data generated by the 15-Queens problem.Yet, as Fig. 11b shows for N = 7, . . ., 15 the computation time per node only slows down by a factor of 1.8 when crossing the memory barrier from N = 14 to 15.Furthermore, this is only a slowdown at N = 15 by a factor of 2.02 compared to the lowest recorded computation time per node at N = 12.

Research Question 2a:
Table 2 shows the average running time of the N -Queens problem for N = 14, resp.the Tic-Tac-Toe problem for N = 22, when using the levelized priority queue compared to the priority queue of TPIE.
Priority Queue Time (s) TPIE 908 L-PQ(1) 678 L-PQ( 4) 677 Table 2: Performance increase by use of the levelized priority queue with k buckets (L-PQ(k)) compared to the priority queue of TPIE.
Performance increases by 25.3% for the Queens and by 37.0% for the Tic-Tac-Toe benchmark when switching from the TPIE priority queue to the levelized priority queue with a single bucket.The BDDs generated in either benchmark have very few (if any) arcs going across more than a single level, which explains the lack of any performance increase past the first bucket.

Research Question 2b:
To answer this question, we move the contents of F leaf into Q red at the start of the Reduce algorithm.This is not possible with the levelized priority queue, so the experiment is conducted on the consumer grade laptop with 7 GiB of ram dedicated to an older version of Adiar with the priority queue of TPIE.The node-to-leaf arcs make up 47.5% percent of all arcs generated in the 14-Queens benchmark.The average time to solve this goes down from 1258 to 919 seconds.This is an improvement of 27.0% which is 0.57% for every percentile the nodeto-leaf arcs contribute to the total number of arcs generated.Compensating for the performance increase in Research Question 2a this only amounts to 20.12%, i.e. 0.42% per percentile.

Research Question 2c:
Like in Research Question 2b, it is to be expected that this optimisation is dependant on the input.Table 3 shows different benchmarks run on the consumer grade laptop with and without pruning.
For N -Queens the largest unreduced BDD decreases in size by at least 23% while the median is unaffected.The x ij ∧ ¬has threat(i, j) base case consists mostly of ⊥ leaves, so they can prune the outermost conjunction of rows but not the disjunction within each row.The relative number of node-to-leaf arcs is also at least doubled, which decreases the number of elements placed in the priority queues.This, together with the decrease in the largest unreduced BDD, explains how the computation time decreases by 49% for N = 15.Considering our results for Research Question 2b above at most half of that speedup can be attributed to increasing the percentage of node-to-leaf arcs.
We observe the opposite for the Tic-Tac-Toe benchmark.The largest unreduced size is unaffected while the median size decreases by at least 20%.Here, the BDD for each line has only two arcs to the ⊥ leaf to shortcut the accumulated conjunction, while the largest unreduced BDDs are created when accumulating the very last few lines, that span almost all levels.Still, there is still a doubling in the total ratio of node-to-leaf arcs and we see at least an 11.6% decrease in the computation time.

Research Question 2d:
Table 4 shows the time to do equality checking on the three largest circuits verified with the Picotrav application run on the Grendel server nodes using Adiar with 300 GiB available.The number of nodes and size reported in the table is of a single set of BDDs that describe the final circuit.That is, since Adiar does not share nodes, then the set of final BDDs for the specification and the optimised circuit are distinct; the mem ctrl benchmark requires 1231 isomorphism checks on a total of 2 • 2.68 • 10 10 nodes which is equivalent to 1.17 TiB of data.In these benchmarks all equality checking was possible to do with a weighted average performance of 0.109 µs/node.Table 5 shows the running time of equality checking by instead computing whether f ↔ g = ⊤.Not even taking the speedup due to the priority queue and separation of node-to-leaf arcs in Section 5.3.2 and 5.3.3into account, this approach, as is necessary with Arge's original algorithms, is 2.42 -63.5 slower than using our improved algorithms.We have compared the performance of Adiar 1.0.1 with the BuDDy 2.4 [30], the CUDD 3.0.0[43], and the Sylvan 1.5.0 [19] BDD package.
To this end, we ran all our benchmarks on Grendel server nodes, which were set to use 350 GiB of the available RAM, while each BDD package is given 300 GiB of it.Sylvan was set to not use any parallelisation and given a ratio between the unique node table and the computation cache of 64:1.BuDDy was set to the same cache-ratio and the size of the CUDD cache was set such it would have an equivalent ratio when reaching its 300 GiB limit.The I/O analysis in Section 2.2.1 is evident in the running time of Sylvan's implicit Reduce, which increases linearly with the size of the node table 4 .Hence, Sylvan has been set to start its table 2 12 times smaller than the final 262 GiB it may occupy, i.e. at first with a table and cache that occupies 66 MiB.
As is evident in Section 5.3.1, the slowdown for Adiar for small instances, due to its use of external memory algorithms, makes it meaningless to compare its performance to other BDD packages when the largest BDD is smaller than 32 MiB.Hence, we have chosen to omit these instances from this report, though the full data set (including these instances) is publicly available.

N-Queens.
Fig. 12 shows for each BDD package their running time computing the N -Queens benchmark for 12 ≤ N ≤ 17.At N = 13, where Adiar's computation time per node is the lowest, Adiar runs by a factor of 5.1 slower than BuDDy, 2.3 than CUDD, and 2.6 than Sylvan.The gap in performance of Adiar and other packages gets smaller as instances grow larger: for N = 15, which is the largest instance solvable by CUDD and Sylvan, Adiar is only slower than CUDD, resp.Sylvan, by a factor of 1.47, resp.2.15. 4 Experiments using perf on Sylvan show that dereferencing a bucket in the unique node table and using x86 locks to obtain exclusive ownership of cache lines are one of the main reasons for the slowdown.We hypothesise that the overhead of mmap is the main culprit.Adiar outperforms all three libraries in terms of successfully computing very large instances.The largest BDD constructed by Adiar for N = 17 is 719 GiB in size, whereas Sylvan with N = 15 only constructs BDDs up to 12.9 GiB in size.Yet, at N = 17, where Adiar has to make heavy use of the disk, Adiar's computation time per node only slows down by a factor of 1.8 compared to its performance at N = 13.
Conversely, Adiar is also able to solve smaller instances with much less memory than other packages.Fig 13 shows the running time for both Adiar, CUDD, and Sylvan solving the 15-Queens problem depending on the available memory.Sylvan was first able to solve this problem when given 56 GiB of memory while CUDD, presumably due to its larger node-size and multiple data structures, requires 72 GiB of memory to be able to compute the solution. Tic-Tac-Toe.
The running times we obtained for this benchmark, as shown in Fig. 14, paint the same picture as for Queens above: the factor with which Adiar runs slower than the other packages decreases as the size of the instance increases.At N = 24, which is the largest instance solved by CUDD, Adiar runs slower than CUDD by a factor of 2.38.The largest instance solved by Sylvan is N = 25 where the largest BDD created by Sylvan is 34.4 GiB in size, and one incurs a 2.50 factor slowdown by using Adiar instead.
Adiar was able to solve the instance of N = 29, where the largest BDD created was 902 GiB in size.Yet, even though the disk was extensively used, Adiar's computation time per node only slows down by a factor of 2.49 compared to its performance at N = 22.

Picotrav.
Table 6 shows the number of successfully verified circuits by each BDD package and the number of benchmarks that were unsuccessful due to a full node table or a full disk and the ones that timed out after 15 days of computation time.
If Sylvan's unique node table and computation cache are immediately instantiated to their full size of 262 GiB then it is able to verify 3 more of the 40 circuits within the 15 days time limit.One of these three is the arbiter benchmark optimised with respect to size that BuDDy and CUDD are able to solve Table 6: Number of verified arithmetic and random/control circuits from [2] in a few seconds.Yet, BuDDy also exhibits a similar slowdown when it has to double its unique node table to its full size.We hypothesise this slowdown is due to the computation cache being cleared when nodes are rehashed into the doubled node table, while this benchmark consists of a lot of repeated computations.The other are the mem ctrl benchmark optimised with respect to size and depth.The performance of Adiar compared to the other packages is reminiscent to our results from the two other benchmarks.For example, the voter benchmark, where the largest BDD for Adiar is 8.2 GiB in size, it is 3.69 times slower than CUDD and 3.07 times slower than Sylvan.In the mem ctrl benchmark optimised for depth which Sylvan barely was able to solve by changing its cache ratio, Adiar is able to construct the BDDs necessary for the comparison 1.01 times faster than Sylvan.This is presumably due to the large overhead of Sylvan in having to repeatedly run its garbage collection algorithms.
Despite the fact that the disk available to Adiar is only 12 times larger than internal memory, Adiar has to explicitly store both the unreduced and reduced BDDs, and many of the benchmarks have hundreds of BDDs concurrently alive, Adiar is still able to solve the same benchmarks as the other packages.For example, the largest solved benchmark, mem ctrl, has at one point 1231 different BDDs in use at the same time.

Conclusions and Future Work
We propose I/O-efficient BDD manipulation algorithms in order to scale BDDs beyond main memory.These new iterative BDD algorithms exploit a total topological sorting of the BDD nodes, by using priority queues and sorting algorithms.All recursion requests for a single node are processed at the same time, which fully circumvents the need for a memoisation table.If the underlying data structures and algorithms are cache-aware or cache-oblivious, then so are our algorithms by extension.The I/O complexity of our time-forward processing algorithms is compared to the conventional recursive algorithms on a unique node table with complement edges [9] in Table 7.
The performance of Adiar is very promising in practice for instances larger than a few hundred MiB.As the size of the BDDs increase, the performance of Adiar gets closer to conventional recursive BDD implementations.When the BDDs involved in the computation exceeds a few GiB then the use of Adiar only results in a 3.69 factor slowdown compared to Sylvan and CUDD -it was only 1.47 times slower than CUDD in the largest Queens benchmark that CUDD could solve.Simultaneously, the design of our algorithms allow us to compute on BDDs that outgrow main memory with only a 2.49 factor slowdown, which is negligible in comparison to the slowdown that conventional BDD packages experience when using swap memory.
This performance comes at the cost of not sharing any nodes and so not being able to compare for functional equivalence in O(1) time.We have improved the asymptotic behaviour of equality checking to only be an O(sort(N )) algorithm which in practice is negligible compared to the time to construct the BDDs involved.For 71.6% of all the output gates we verified from EPFL Combinational Benchmark Suite [2] we were even able to do so with a simple O(N/B) linear scan that can compare more than 1.86 GiB/s.This number is realistic, since modern SSDs, depending on block size used, have a sequential transfer rate of 1 GiB/s and 2.8 GiB/s.
In practice, the fact that nodes are not shared does not negatively impact the ability of Adiar to solve a problem in comparison to conventional BDD packages.This is despite the ratio between disk and main memory is smaller than the number of BDDs in use.Furthermore, garbage collection becomes a trivial and cheap deletion of files on disk.This lays the foundation on which we intend to develop external memory BDD algorithms usable in the fixpoint algorithms for symbolic model checking.We will, to this end, attempt to improve further on the non-constant equality checking, improve the performance of quantifying multiple variables, and design an I/O-efficient relational product function.Furthermore, we intend to improve Adiar's performance for smaller instances by processing them exclusively in internal memory and generalise its implementation to also include Multi-Terminal [21] and Zero-suppressed [33]

Figure 1 :
Figure 1: Examples of Reduced Ordered Binary Decision Diagrams.Leaves are drawn as boxes with the boolean value and internal nodes as circles with the decision variable.Low edges are drawn dashed while high edges are solid.

Figure 3 :
Figure 3: In-order representation of BDDs of Fig. 1

Figure 4 :
Figure 4: The Apply-Reduce pipeline of our proposed algorithms Semi-transposed graph.(pairs indicate nodes in Fig.1a and 1b, respectively) (b) In-order arc-based representation.

Figure 6 :
Figure 6: The RequestsFor subroutine for Apply t a b e l ← u n d e f i n e d / * I n s e r t r e q u e s t f o r r o o t (v f , vg) * / Qapp:1 .push ( NIL undefined − −−−− → (v f .uid, vg.uid ) ) / * P r o c e s s r e q u e s t s i n t o p o l o g i c a l o r d e r * / while Qapp:1 high ) od e l s e / * P r o c e s s r e q u e s t (t f , tg) * / i d ← i f l a b e l = t seek .label then 0 e l s e i d +1 l a b e l ← t seek .label / * Forward or o u t p u t out −g o i n g a r c s * / r low , r high ← RequestsFor ( (

Figure 7 :
Figure 7: The Apply algorithm

Figure 8 :
Figure 8: The ShortcuttedRequestsFor subroutine for Apply ∈ Fj:2 by uid i n d e s c e n d i n g o r d e r f o r each [ uid → uid ′ ] ∈ MergeMaxUid( Fj:1 , Fj:2 ) do while a r c s from F internal .peek ( ) matches − → uid do ( s is high

Figure 9 :
Figure 9: The Reduce algorithm

The difference c 1
•sort(N ℓ /B)−N ℓ /B is the least number of saved I/Os, whereas c 2 •sort(N ℓ /B)−N ℓ /B is the maximal number of saved I/Os.The O(sort(N ℓ /B)) bound follows easily as shown below for N ℓ > n 0 .

Proposition 3 . 3 (
Following Arge 1996 [6]).The Reduce algorithm in Fig.9has an O(sort(N )) I/O complexity and an O(N log N ) time complexity.Proof.Up to 2N arcs are inserted in and later again extracted from Q red , while F internal and F leaf is scanned once.This totals an O(sort(4N ) + N/B) = O(sort(N )) number of I/Os spent on the priority queue.On each level all nodes are sorted twice, which when all levels are combined amounts to another O(sort(N )) I/Os.One arrives with similar argumentation at the O(N log N ) time complexity.

Proposition 3 . 5 .
Negation has I/O, space, and time complexity O(1).Proof.Even though O(N ) extra work has to be spent on negating every nodes inside of all other procedures, this extra work can be accounted for in the big-O of the other operations.

Proposition 3 . 6 .
If the BDDs for f and g satisfy the constraints in (2), then the above equality checking procedure has I/O complexity O(sort(N )) and time complexity O(N • log N ), where N is the size of both BDDs.

Proposition 3 . 9 .
Counting the number of paths and satisfying assignments in the BDD for f has I/O complexity O(sort(N )) and time complexity O(N log N ), where N is the size of the BDD for f .Proof.Every node of f is processed in-order and all 2N arcs are inserted and extracted once from the priority queue.This totals O(sort(N )) + N/B I/Os.

Lemma 3 . 10 .
Evaluating f for some assignment x has I/O complexity N/B + x/B and time complexity O(N + | x|), where N is the size of f 's BDD.

Lemma 3 . 11 .
Obtaining the lexicographically smallest (or largest) assignmentx such that f ( x) = ⊤ has I/O complexity N/B + x/B and time complexity O(N ), where N is the size of the BDD for f .

Proposition 3 . 13 .
Computing the Restrict of f for a single variable x i has I/O complexity O(sort(N )) and time complexity O(N log N ), where N , is the size of the BDD for f .Proof.The priority queue requires O(sort(2N )) I/Os since every arc in the priority queue is related one-to-one with one of the 2N arcs of the input.The output consists of at most 2N arcs, which takes 2 • N/B I/Os to write and at most O(sort(2N )) to sort.All in all, the algorithm uses N/B + O(sort(2N )) + 2 • N/B + O(sort(2N )) I/Os.

Corollary 3 . 14 .
Computing the Restrict of f for multiple variables x has I/O complexity O(sort(N ) + | x|/B) and time complexity O(N log N + | x|), where N , is the size of the BDD for f .Proof.The only extra work added is reading x in-order and to decide for all L ≤ N levels whether to keep the level's nodes or bridge over them.This constitutes another | x|/B I/Os and O(N ) time.
Unique identifier of a leaf v 0 label identifier f (b) Unique identifier of an internal node

Figure 10 :
Figure 10: Bit-representations.The least significant bit is right-most.

1 .
How well does our technique perform on BDDs larger than main memory?2. How big an impact do the optimisations we propose have on the computation time and memory usage?(a) Use of the levelized priority queue.(b) Separating the node-to-leaf arcs of Reduce from the priority queue.(c) Pruning the output of Apply by shortcutting the given operator.
Computing 15-Queens with different amount of memory available.
node (b) Computing N -Queens solutions with 7 GiB of memory available.

Figure 11 :
Figure 11: Performance of Adiar in relation to available memory.

Table 4 :
Running time for equivalence testing.O(sort(N )) and O(N/B) is the number of times the respective algorithm in Section 3.3.3was used.The voter benchmark is especially interesting, since it consists only of a single output gate and the O(sort(N )) and O(N/B) algorithm are used respectively in the depth and size optimised instance.As witnessed in Section 5.3.1,Adiar has a bad performance for smaller instances.Yet, despite only a total of 11.48 MiB of data is compared, the O(sort(N )) algorithm runs at 0.116 µs/node and the O(N/B) scan at only 0.012 µs/node.That is, the O(N/B) algorithm can compare at least 2 • 5.75 MiB/0.006s = 1.86 GiB/s.

Figure 13 :
Figure 13: Running time of 15-Queens with variable memory (lower is better).

Figure 14 :
Figure 14: Running time solving Tic-Tac-Toe (lower is better).# solved # out-of-space # time-out Adiar 23 6 11 BuDDy 19 20 1 CUDD 20 19 1 Sylvan 20 13 7 PopMax( Q red , F leaf ) e low ← PopMax( Q red , F leaf ) i f e high .tar g e t = e low .t a r g e t then Fj:1 .push ( [ e low .s o u r c e → e low .t a r g e t ] ) e l s e Fj .push ( ( e low .s o u r c e , e low .t a r g e t , e high .t a r g e t ) ) od s o r t v ∈ Fj by v .low and s e c o n d l y by top ( ) .s o u r c e .l a b e l ; i d ← MAX ID ; Fj ← [ ] ; Fj:1 ← [ ] ; Fj:2 ← [ ] while Q red .top ( ) .s o u r c e .l a b e l = j do e high ← There are at most N 2 pairs to process which results in at most 2 • N 2 number of arcs in the output and in the priority queues.The two complexity bounds follows by similar analysis as in the proof of Proposition 3.1 and 3.13.
Proposition 3.15.Computing the Exists and Forall of f for a single variable x i has I/O complexity O(sort(N 2 )) and time complexity O(N 2 log N 2 ), where N is the size of the BDD for f .Proof.A request for a node t is equivalent to a request to (t, t).

Table 3 :
Effect of pruning on performance.The first row for each feature is without pruning and the second is with pruning.

Table 5 :
Running time for checking equality with f ↔ g = ⊤ Out of the 3861 output gates checked for equality throughout the 23 verified circuits the O(N/B) linear scan could be used for 71.6% of them.
Decision Diagrams.
f , N g

Table 7 :
I/O-complexity of conventional depth-first algorithms compared to the time-forwarded we propose.Here, N/B < sort(N ) N/B •log M/B (N/B) ≪ N , where N is the number of nodes, and L the number of levels in the BDD.