Sylvan: multicore framework for decision diagrams
 773 Downloads
 3 Citations
Abstract
Decision diagrams, such as binary decision diagrams, multiterminal binary decision diagrams and multivalued decision diagrams, play an important role in various fields. They are especially useful to represent the characteristic function of sets of states and transitions in symbolic model checking. Most implementations of decision diagrams do not parallelize the decision diagram operations. As performance gains in the current era now mostly come from parallel processing, an ongoing challenge is to develop datastructures and algorithms for modern multicore architectures. The decision diagram package Sylvan provides a contribution by implementing parallelized decision diagram operations and thus allowing sequential algorithms that use decision diagrams to exploit the power of multicore machines. This paper discusses the design and implementation of Sylvan, especially an improvement to the lockfree unique table that uses bit arrays, the concurrent operation cache and the implementation of parallel garbage collection. We extend Sylvan with multiterminal binary decision diagrams for integers, real numbers and rational numbers. This extension also allows for custom MTBDD leaves and operations and we provide an example implementation of GMP rational numbers. Furthermore, we show how the provided framework can be integrated in existing tools to provide outofthebox parallel BDD algorithms, as well as support for the parallelization of higherlevel algorithms. As a case study, we parallelize onthefly symbolic reachability in the model checking toolset LTSmin. We experimentally demonstrate that the parallelization of symbolic model checking for explicitstate modeling languages, as supported by LTSmin, scales well. We also show that improvements in the design of the unique table result in faster execution of onthefly symbolic reachability.
Keywords
Multicore Parallel Binary decision diagrams Multiterminal binary decision diagrams Multivalued decision diagrams Symbolic model checking1 Introduction
In model checking, we create models of complex systems to verify that they function according to certain properties. Systems are modeled using possible states and transitions between these states. An important part of many model checking algorithms is statespace exploration using a reachability algorithm, to compute all states reachable from some initial state. A major challenge is that the space and time requirements of these algorithms increase exponentially with the size of the models. One method to alleviate this problem is symbolic model checking [12], where states are not treated individually but as sets of states, stored in binary decision diagrams (BDDs). For many symbolic model checking algorithms, most time is spent in the BDD operations. Another method uses parallel computation, e.g., in computer systems with multiple processors. In [21, 23, 26], we combined both approaches by parallelizing BDD operations in the parallel BDD library Sylvan.
Contributions This paper is an extended version of [26]. We refer also to the PhD thesis of the first author [21] for a more extensive treatment of multicore decision diagrams.
In [26], we presented an extension to Sylvan that implements operations on list decision diagrams (LDDs). We also investigated applying parallelism on a higher level than the BDD/LDD operations. Since computing the full transition relation is expensive, the model checking toolset LTSmin [7, 24, 38, 42] has the notion of transition groups, which disjunctively partition the transition relation. We exploited the fact that partitioned transition relations can be applied in parallel and showed that this results in improved scalability. In addition, LTSmin supports learning transition relations onthefly, which enables the symbolic model checking of explicitstate models. We implemented a specialized operation collect, which combines enumerate and union, to perform parallel transition learning and we showed that this results in good parallel performance.
Since [26], we equipped Sylvan with a versatile implementation of MTBDDs, allowing symbolic computations on integers, floatingpoints, rational numbers and other types. We discuss the design and implementation of our MTBDD extension, as well as an example of custom MTBDD leaves with the GMP library. Furthermore, we redesigned the unique table to require fewer cas operations per created node. We also describe the operation cache and parallel garbage collection in Sylvan.
Experiments on the BEEM database of explicitstate models show that parallelizing the higher level algorithms in LTSmin pays off, as the parallel speedup increases from 5.6\(\times \) to 16\(\times \), while the sequential computation time (with 1 worker) stays the same. The experiments also show that LDDs perform better than BDDs for this set of benchmarks. In addition to the experiment performed in [26], we include additional experiments using the new hash table. These benchmark results show that the new hash table results in a 21 % faster execution for 1 worker, and a 30 % faster execution with 48 workers, improving the parallel speedup from 16\(\times \) to 18\(\times \).
Outline
This paper is organized as follows. After a review of the related work in Sect. 2, we introduce decision diagrams and parallel programming in Sect. 3. Section 4 discusses how we use workstealing to parallelize operations. Section 5 presents the implementation of the datastructures of the unique table and the operation cache, as well as the implementation of parallel garbage collection in Sylvan. Section 6 discusses the implementation of specific decision diagram operations, especially the BDD and MTBDD operations. In Sect. 7, we apply parallelization to onthefly symbolic reachability in LTSmin. Section 8 shows the results of several experiments using the BEEM database of explicitstate models to measure the effectiveness of our approach. Finally, Sect. 9 summarizes our findings and reflections.
2 Related work
This section is largely based on earlier literature reviews we presented in [23, 26].
Massively parallel computing (early ’90s) In the early ’90s, researchers tried to speed up BDD manipulation by parallel processing. The first paper [39] views BDDs as automata, and combines them by computing a product automaton followed by minimization. Parallelism arises by handling independent subformulae in parallel: the expansion and reduction algorithms themselves are not parallelized. They use locks to protect the global hash table, but this still results in a speedup that is almost linear with the number of processors. Most other work in this era implemented BFS algorithms for vector machines [47] or massively parallel SIMD machines [13, 32] with up to 64K processors. Experiments were run on supercomputers, like the Connection Machine. Given the large number of processors, the speedup (around 10–20) was disappointing.
Parallel operations and constructions An interesting contribution in this period is the paper by Kimura et al. [40]. Although they focus on the construction of BDDs, their approach relies on the observation that suboperations from a logic operation can be executed in parallel and the results can be merged to obtain the result of the original operation. Our solution to parallelizing BDD operations follows the same line of thought, although the workstealing method for efficient load balancing that we use was first published 2 years later [8]. Similar to [40], Parasuram et al. implement parallel BDD operations for distributed systems, using a “distributed stack” for load balancing, with speedups from 20–32 on a CM5 machine [50]. Chen and Banerjee implemented the parallel construction of BDDs for logic circuits using lockbased distributed hash tables, parallelizing on the structure of the circuits [14]. Yang and O’Hallaron [60] parallelized breadthfirst BDD construction on multiprocessor systems, resulting in reasonable speedups of up to 4\(\times \) with 8 processors, although there is a significant synchronization cost due to their lockprotected unique table.
Distributed memory solutions (late ’90s) Attention shifted towards Networks of Workstations, based on message passing libraries. The motivation was to combine the collective memory of computers connected via a fast network. Both depthfirst [3, 5, 57] and breadthfirst [53] traversal have been proposed. In the latter, BDDs are distributed according to variable levels. A worker can only proceed when its level has a turn, so these algorithms are inherently sequential. The advantage of distributed memory is not that multiple machines can perform operations faster than a single machines, but that their memory can be combined in to handle larger BDDs. For example, even though [57] reports a nice parallel speedup, the performance with 32 machines is still 2\(\times \) slower than the nonparallel version. BDDNOW [46] is the first BDD package that reports some speedup compared to the nonparallel version, but it is still very limited.
Parallel symbolic reachability (after 2000) After 2000, research attention shifted from parallel implementations of BDD operations towards the use of BDDs for symbolic reachability in distributed [15, 33] or shared memory [18, 28]. Here, BDD partitioning strategies such as horizontal slicing [15] and vertical slicing [35] were used to distribute the BDDs over the different computers. Also, the saturation algorithm [16], an optimal iteration strategy in symbolic reachability, was parallelized using horizontal slicing [15] and using the workstealer Cilk [28], although it is still difficult to obtain good parallel speedup [18].
Multicore BDD algorithms There is some recent research on multicore BDD algorithms. There are several implementations that are threadsafe, i.e., they allow multiple threads to use BDD operations in parallel, but they do not offer parallelized operations. In a thesis on the BDD library JINC [49], Chapter 6 describes a multithreaded extension. JINC’s parallelism relies on concurrent tables and delayed evaluation. It does not parallelize the basic BDD operations, although this is mentioned as possible future research. Also, a recent BDD implementation in Java called BeeDeeDee [43] allows execution of BDD operations from multiple threads, but does not parallelize single BDD operations. Similarly, the wellknown sequential BDD implementation CUDD [56] supports multithreaded applications, but only if each thread uses a different “manager”, i.e., unique table to store the nodes in. Except for our contributions [23, 24, 26] related to Sylvan, there is no recent published research on modern multicore sharedmemory architectures that parallelizes the actual operations on BDDs. Recently, Oortwijn [48] continued our work by parallelizing BDD operations on shared memory abstractions of distributed systems using remote direct memory access. Also, Velev and Gao [58] have implemented parallel BDD operations on a GPU using a parallel cuckoo hash table.
Finally, we refer to Somenzi [55] for a detailed paper on the implementation of decision diagrams, and to the PhD thesis of the first author [21] on multicore decision diagrams.
3 Preliminaries
This section presents the definitions of binary decision diagrams (BDDs), multivalued decision diagrams (MDDs), multiterminal binary decision diagrams (MTBDDs) and list decision diagrams (LDDs) from the literature [4, 6, 11, 37]. Furthermore, we discuss parallel programming.
3.1 Decision diagrams
Binary decision diagrams (BDDs) are a concise and canonical representation of Boolean functions \(\mathbb {B}^N\rightarrow \mathbb {B}\) [2, 11]. They are a basic structure in discrete mathematics and computer science. A (reduced, ordered) BDD is a rooted directed acyclic graph with leaves 0 and 1. Each internal node has a variable label \(x_i\) and two outgoing edges labeled 0 and 1, called the “low” and the “high” edge. Furthermore, variables are encountered along each directed path according to a fixed variable ordering. Duplicate nodes (two nodes with the same variable label and outgoing edges) and nodes with two identical outgoing edges (redundant nodes) are forbidden. It is well known that, given a fixed variable ordering, every Boolean function is represented by a unique BDD [11].
As an alternative to MDDs, list decision diagrams (LDDs) represent sets of integer vectors, such as sets of states in model checking. List decision diagrams encode functions \((\mathbb {N}_{< v})^N\rightarrow \mathbb {B}\), and were initially described in [6, Sect. 5]. A list decision diagram is a rooted directed acyclic graph with leaves 0 and 1. Each internal node has a value v and two outgoing edges labeled > and \(=\), also called the “right” and the “down” edge. Along the “right” edges, values v are encountered in ascending order. The “down” edge never points to leaf 0 and the “right” edge never points to leaf 1. Duplicate nodes are forbidden. See Fig. 4 for an example of an LDD that represents the same set as the MDD in Fig. 3.
LDD nodes have a property called a level (and its dual, depth), which is defined as follows: the root node is at the first level, nodes along “right” edges stay in the same level, while “down” edges lead to the next level. The depth of an LDD node is the number of “down” edges to leaf 1.
LDDs compared to MDDs
A typical method to store MDDs in memory stores the variable label \(x_i\) plus an array holding all \(n_i\) edges (pointers to nodes), e.g., in [45]: struct node { int lvl; node* edges[]; }. New nodes are allocated dynamically using malloc and a hash table ensures that no duplicate MDD nodes are created. Alternatively, one could use a large int[] array to store all MDDs (each MDD is represented by \(n_i+1\) consecutive integers) and represent edges to an MDD as the index of the first integer. In [17], the edges are stored in a separate int[] array to allow the number of edges \(n_i\) to vary. Implementations of MDDs that use arrays to implement MDD nodes have two disadvantages. (1) For sparse sets (where only a fraction of the possible values are used, and outgoing edges to 0 are not stored) using arrays is a waste of memory. (2) MDD nodes typically have a variable size, complicating memory management.
List decision diagrams can be understood as a linkedlist representation of “quasireduced” MDDs. Quasireduced MDDs are a variation of normal (fullyreduced) MDDs. Instead of forbidding redundant nodes (with identical outgoing edges), quasireduced MDDs forbid skipping levels. They are canonical representations, like fullyreduced MDDs. An advantage of quasireduced MDDs is that, for some applications, edges that do not skip levels can be easier to manage [17]. Also, variables labels do not need to be stored as they follow implicitly from the depth of the MDD.
LDDs have several advantages compared to MDDs [6]. LDD nodes are binary, so they have a fixed node size which is easier for memory allocation. They are better for sparse sets: valuations that lead to 0 simply do not appear in the LDD. LDDs also have more opportunities for the sharing of nodes, as demonstrated in the example of Fig. 4, where the LDD encoding the set \(\{2,4\}\) is used for the set \(\{0,2,4\}\) and reused for the set \(\{\left\langle 3,2 \right\rangle ,\left\langle 3,4 \right\rangle \}\), and similarly, the LDD encoding \(\{1\}\) is used for \(\{0,1\}\) and for \(\{\left\langle 6,1 \right\rangle \}\). A disadvantage of LDDs is that their linkedlist style introduces edges “inside” the MDD nodes, requiring more memory pointers, similar to linked lists compared with arrays.
3.2 Decision diagram operations
Operations on decision diagrams are typically recursively defined. Suboperations are computed based on the subgraphs of the inputs, i.e., the decision diagrams obtained by following the two outgoing edges of the root node, and their results are used to compute the result of the full operation. In this subsection we look at Algorithm 1, a generic example of a BDD operation. This algorithm takes as inputs the BDDs x and y (with the same fixed variable ordering), to which a binary operation \(\textsf {F}\) is applied. We assume that, given the same parameters, F always returns the same result. Therefore, we use a cache to store the results of (sub)operations. This is in fact required to reduce the complexity class of many BDD operations from exponential time to polynomial time.
Most decision diagram operations first check if the operation can be applied immediately to x and y (line 2). This is typically the case when x and y are leaves. Often there are also other trivial cases that can be checked first. After this, the operation cache is consulted (lines 3–4). In cases where computing the result for leaves or other cases takes a significant amount of time, the cache should be consulted first. Often, the parameters can be normalized in some way to increase the cache efficiency. For example, \(a\wedge b\) and \(b\wedge a\) are the same operation. In that case, normalization rules can rewrite the parameters to some standard form to increase cache utilization, at line 3. A wellknown example is the ifthenelse algorithm, which rewrites using rewrite rules called “standard triples” as described in [10].
The operation lookupBDDnode is given in Algorithm 2. This operation ensures that there are no redundant nodes (line 2) and no complement mark on the low edge (lines 3–4) and employs the method findorinsert (implemented by the unique table, see Sect. 5) to ensure that there are no duplicate nodes (lines 6 and 9). If the hash table is full, then garbage collection is performed (line 8).
3.3 Parallel programming
In parallel programs, memory accesses can result in race conditions or data corruption, for example when multiple threads write to the same memory location. Often datastructures are protected against race conditions using locking techniques. While locks are relatively easy to implement and reason about, they can severely cripple parallel performance, especially as the number of threads increases. Threads must wait until the lock is released, and locks can be a bottleneck when many threads try to acquire the same lock. Also, locks can sometimes cause spurious delays that smarter datastructures could avoid, for example by recognizing that some operations do not interfere even though they access the same resource.
This operation atomically compares the contents of a given location in shared memory to some given expected value and, if the contents match, changes the contents to a given new value. If multiple processors try to change the same bytes in memory using cas at the same time, then only one succeeds.
Datastructures that avoid locks are called nonblocking or lockfree. Such datastructures often use the atomic cas operation to make progress in an algorithm, rather than protecting a part that makes progress. For example, when modifying a shared variable, an approach using locks would first acquire the lock, then modify the variable, and finally release the lock. A lockfree approach would use atomic cas to modify the variable directly. This requires only one memory write rather than three, but lockfree approaches are typically more complicated to reason about, and prone to bugs that are more difficult to reproduce and debug.

In blocking datastructures, it may be possible that no threads make progress if a thread is suspended. If an operation may be delayed forever because another thread is suspended, then that operation is blocking.

In lockfree datastructures, if any thread working on the datastructure is suspended, then other threads must still be able to perform their operations. An operation may be delayed forever, but if this is because another thread is making progress and never because another thread is suspended, then that operation is lockfree.

In waitfree datastructures, every thread can complete its operation within a bounded number of steps, regardless of the other threads; all threads make progress.
3.4 System architecture
This paper assumes a cache coherent shared memory NUMA architecture, i.e., there are multiple processors and multiple memories, with a hierarchy of caches, all connected via interconnect channels. The shared memory is divided into regions called cachelines, which are typically 64 bytes long. Only whole cachelines are communicated between processors and with the memory. Datastructures designed for multicore sharedmemory architectures should aim to minimize the number of cacheline transfers to be efficient. We also assume the x86 TSO memory model [54]. In this memory model, memory writes of each processor are not reordered, but memory writes can be buffered. The datastructures presented in this paper rely on compareandswap instructions and assume total store ordering for their correctness.
4 Parallel operations using workstealing
This section describes how we use workstealing to execute operations on decision diagrams in parallel.
We implement recursively defined operations such as Algorithm 1 as independent tasks using a taskbased parallel framework. For task parallelism that fits a “strict” forkjoin model, i.e., each task creates the subtasks that it depends on, workstealing is well known to be an effective load balancing method [8], with implementations such as Cilk [9, 31] and Wool [29, 30] that allow writing parallel programs in a style similar to sequential programs [1]. Workstealing has been proven to be optimal for a large class of problems and has tight memory and communication bounds [8].
We use do in parallel to denote that tasks are executed in parallel. Programs in the Cilk/Wool style are then implemented like in Fig. 5. The SPAWN keyword creates a new task. The SYNC keyword matches with the last unmatched SPAWN, i.e., operating as if spawned tasks are stored on a stack. It waits until that task is completed and retrieves the result. Every SPAWN during the execution of the program must have a matching SYNC. The CALL keyword skips the task stack and immediately executes a task.
We substituted the workstealing framework Wool [29], that we used in the first published version of Sylvan [23], by Lace [25], which we developed based on ideas to minimize interactions between workers and with the shared memory. Lace is based around a novel workstealing queue, which is described in detail in [25]. Lace also implements extra features necessary for parallel garbage collection.
To implement tasks, Lace provides C macros that require only few modifications of the original source code. One helpful feature for garbage collection in Sylvan that we implemented in Lace is a feature that suspends all current tasks and starts a new task tree. Lace implements a macro NEWFRAME(...) that starts a new task tree, where one worker executes the given task and all other workers perform workstealing to help execute this task in parallel. The macro TOGETHER(...) also starts a new task tree, but all workers execute a local copy of the given task.
Sylvan uses the NEWFRAME macro as part of garbage collection, and the TOGETHER macro to perform threadspecific initialization. Programs that use Sylvan can also use the Lace framework to parallelize their highlevel algorithms. We give an example of this in Sect. 7.
5 Concurrent datastructures
This section describes the concurrent datastructures required to parallelize decision diagram operations. Every operation requires a scalable concurrent unique table for the BDD nodes and a scalable concurrent operation cache. We use a single unique table for all BDD nodes and a single operation cache for all operations.
The parallel efficiency of a taskbased parallelized algorithm depends largely on the contents of each task. For example, tasks that perform many processor calculations and few memory operations typically result in good speedups. Also, tasks that have many subtasks provide load balancing frameworks with ample opportunity to execute independent tasks in parallel. If the number of subtasks is small and the subtasks are relatively shallow, i.e., the “task tree” has a low depth, then parallelization is more difficult.
BDD operations typically perform few calculations and are memoryintensive, since they consist mainly of calls to the operation cache and the unique table. Furthermore, BDD operations typically spawn only one or two independent subtasks for parallel execution, depending on the inputs and the operation. Hence the design of scalable concurrent datastructures (for the cache and the unique table) is crucial for the parallel performance of BDD implementation.
5.1 Representation of nodes
This subsection discusses how BDD nodes, LDD nodes and MTBDD nodes are represented in memory. We use 16 bytes for all types of nodes, so we can use the same unique table for all nodes and have a fixed node size. As we see below, not all bits are needed; unused bits are set to 0. Also, with 16 bytes per node, this means that 4 nodes fit exactly in a cacheline of 64 bytes (the size of the cacheline for many current computer architectures, in particular the x86 family that we use), which is very important for performance. If the unique table is properly aligned in memory, then only one cacheline needs to be accessed when accessing a node.
We use 40 bits to store the index of a node in the unique table. This is sufficient to store up to \(2^{40}\) nodes, i.e. 16 terabytes of nodes, excluding overhead in the hash table (to store all the hashes) and other datastructures. As we see below, there is sufficient space in the nodes to increase this to 48 bits per node (up to 4096 terabytes), although that would have implications for the performance (more difficult bit operations) and for the design of the operation cache.
Edges to nodes Sylvan defines the type BDD as a 64bit integer, representing an edge to a BDD node. The lowest 40 bits represent the location of the BDD node in the nodes table, and the highestsignificant bit stores the complement mark [10]. The BDD 0 is reserved for the leaf false, with the complemented edge to 0 (i.e. 0x8000000000000000) meaning true. We use the same method for MTBDDs and LDDs, although most MTBDDs do not have complemented edges. LDDs do not have complemented edges at all. The LDD leaf false is represented as 0, and the LDD leaf true is represented as 1. For the MTBDD leaf \(\bot \) we use the leaf 0 that represents Boolean false as well. This has the advantage that Boolean MTBDDs can act as filters for MTBDDs with the MTBDD operation times. The disadvantage is that partial Boolean MTBDDs are not supported by default, but can easily be implemented using a custom MTBDD leaf.
Internal BDD nodes Internal BDD nodes store the variable label (24 bits), the low edge (40 bits), the high edge (40 bits), and the complement bit of the high edge (the first bit below).
MTBDD leaves For MTBDDs we use a bit that indicates whether a node is a leaf or not. MTBDD leaves store the leaf type (32 bits), the leaf contents (64 bits) and the fact that they are a leaf (1 bit, set to 1):
Internal MTBDD nodes Internal MTBDD nodes store the variable label (24 bits), the low edge (40 bits), the high edge (40 bits), the complement bit of the high edge (1 bit, the first bit below) and the fact they are not a leaf (1 bit, the second bit below, set to 0).
Internal BDD nodes are identical to internal MTBDD nodes, as unused bits are set to 0. Hence, the BDD 0 can be used as a terminal for Boolean MTBDDs, and the resulting Boolean MTBDD is identical to a BDD of the same function.
Internal LDD nodes Internal LDD nodes store the value (32 bits), the down edge (40 bits) and the right edge (40 bits):
5.2 Scalable unique table
This subsection describes the hash tables that we use to store the unique decision diagram nodes. We refer to [21] for a more extensive treatment of these hash tables.
 1.
During normal operation, threads only call the method findorinsert, which takes as input the node and either returns a unique identifier for the data, or raises the TableFull signal if the algorithm fails to insert the data.
 2.
During garbage collection, findorinsert is never called.

Using a probe sequence called “walkingtheline” that is efficient with respect to transferred cachelines. See also Fig. 6.

Using a lightweight parametrised local “writing lock” when inserting data, which almost always only delays threads that insert the same data.

Separating the stored data in a “data array” and the hash of the data in the “hash array” so directly comparing the stored data is often avoided. See also Fig. 7.
Writing lock When multiple workers simultaneously access the hash table to find or insert data, there must be some mechanism to avoid race conditions, such as inserting the same data twice, or trying to insert different data at the same location simultaneously. Rather than using a global lock on the entire hash table or regions of the hash table, or a nonspecific local lock on each bucket, the hash table of [41] combines a shortlived local lock with a hash value of the data that is inserted. This way, threads that are finding or inserting data with a different hash value know that they can skip the locked bucket in their search.
An empty bucket is first locked using an atomic cas operation that sets the lock with the hash value of the inserted data, then writes the data, and then releases the lock. Only workers that are finding or inserting data with the same hash as the locked bucket need to wait until the lock is released. This approach is not lockfree. The authors state that a mechanism could be implemented that ensures local progress (making the algorithm waitfree), however, this is not needed, since the writing locks are rarely hit under normal operation [41].
The hash table presented in [26] stores independent locations for the bucket in the hash array and in the data array. The idea is that the location of the decision diagram node in the data array is used for the node identifier and that nodes can be reinserted into the hash array without changing the node identifier. This is important, since garbage collection is performed often and nodes identifiers should remain unchanged during garbage collection, i.e., nodes should not be moved. To implement this feature, the buckets from the hash array are extended to contain the index in the data array where the corresponding data is stored, as well as a bit that controls whether the bucket in the data array with the same index is in use (see Fig. 8). See further [26].
The hash table in [26] has the drawback that the speculative insertion and uninsertion into the data array requires atomic cas operations, once for the insertion, once for the uninsertion. Instead of using a field D in the hash array, we use a separate bit array databits to implement a parallel allocator for the data array. Furthermore, to avoid having to use cas for every change to databits, we divide this bit array into regions, such that every region matches exactly with one cacheline of the databits array, i.e., 512 buckets per region if there are 64 bytes in a cacheline, which is the case for most current architectures. Every worker has exclusive access to one region, which is managed with a second bit array regionbits. Only changes to regionbits (to claim a new region) require an atomic cas. The new version therefore, only uses normal writes for insertion and uninsertion into the data array, and only occasionally an atomic cas during speculative insertion to obtain exclusive access to the next region of 512 buckets.
The layout of the hash array and the data array is given in Fig. 9. We also remove the field H, which is obsolete as we use a hash function that never hashes to 0 and we forbid nodes with the index 0 because 0 is a reserved value in Sylvan. The fields hash and index are therefore, never 0, unless the hash bucket is empty, so the field H to indicate that hash and index have valid values is not necessary. Manipulating the hash array bucket is also simpler, since we no longer need to take into account changes to the field D.
Inserting data into the hash table consists of three steps. First the algorithm determines whether the data is already in the table. If this is not the case, then a new bucket in the data array is reserved in the current region of the thread with reservedatabucket. If the current region is full, then the thread claims a new region with claimnextregion. It may be possible that the next region contains used buckets, if there has been a garbage collection earlier, or even that it is already full for this reason. When the data has been inserted into an available bucket in the the data array, the (hash and index of) the data is also inserted into the hash array. Sometimes, the data has been inserted concurrently (by another thread) and then the bucket in the data array is freed again with the freedatabucket function, so it is available the next time the thread wants to insert data.
The main method of the hash table is findorinsert. See Algorithm 3. The algorithm uses the local variable “index” to keep track of whether the data is inserted into the data array. This variable is initialized to 0 (line 2) which signifies that data is not yet inserted in the data array. For every bucket in the probe sequence, we first check if the bucket is empty (line 6). In that case, the data is not yet in the table. If we did not yet write the data in the data array, then we reserve the next bucket and write the data (lines 7–9). We use atomic cas to insert the hash and index into the hash array (line 10). If this is successful, then the algorithm is done and returns the location of the data in the data array. If the cas operation fails, some other thread inserted data here and we refresh our knowledge of the bucket (line 11) and continue at line 12. If the bucket is not or no longer empty, then we compare the stored hash with the hash of our data, and if this matches, we compare the data in the data array with the given input (line 12). If this matches, then we may need to free the reserved bucket (line 13) and we return the index of the data in the data array (line 14). If we finish the probe sequence without inserting the data, we raise the TableFull signal (line 15).
The findorinsert method relies on the methods reservedatabucket and freedatabucket, which are also given in Algorithm 3. They are straightforward.
The claimnextregion method searches for the first 0bit in the regionbits array. The value tablesize here represents the size of the entire table. We use a simple linear search and a casloop to actually claim the region. Note that we may be competing with threads that are trying to set the bit of a different region, since the smallest range for the atomic cas operation is 1 byte or 8 bits.
The algorithms in Algorithm 3 are waitfree. The method claimnextregion is waitfree, since the number of cas failures is bounded: regions are only claimed and not released (until garbage collection), and the number of regions is bounded, so the maximum number of cas failures is the number of regions. The freedatabucket is trivially waitfree: there are no loops. The reservedatabucket method contains a loop, but since claimnextregion is waitfree and the number of times claimnextregion returns a value instead of raising the TableFull signal is bounded by the number of regions, reservedata bucket is also waitfree. Finally the findorinsert method only relies on waitfree methods and has only one forloop (line 4) which is bounded by the number of items in the probe sequence. It is therefore, also waitfree.
5.3 Scalable operation cache
The operation cache is a hash table that stores intermediate results of BDD operations. It is well known that an operation cache is required to reduce the worstcase time complexity of BDD operations from exponential time to polynomial time.
In practice, we do not guarantee this property. Since Sylvan is a parallel package, it is possible that multiple workers compute the same operation simultaneously. While operations could use the operation cache to “claim” a computation (using a dummy result and promising a real result later), we found that the amount of duplicate work due to parallelism is limited. In addition, to guarantee polynomial time, the operation cache must store every subresult. In practice, we find that we obtain a better performance by caching only many results instead of all results, and by allowing the cache to overwrite earlier results when there is a hash collision.
In [55], Somenzi writes that a lossless computed table guarantees polynomial cost for the basic synthesis operations, but that lossless tables (that do not throw away results) are not feasible when manipulating many large BDDs and in practice lossy computed tables (that may throw away results) are implemented. If the cost of recomputing subresults is sufficiently small, it can pay off to regularly delete results or even prefer to sometimes skip the cache to avoid data races. We design our operation cache below to abort operations as fast as possible when there may be a data race or the data may already be in the cache.
On top of this, our BDD implementation implements caching granularity, which controls when results are cached. Most BDD operations compute a result on a variable \(x_i\), which is the top variable of the inputs. For granularity G, a variable \(x_i\) is in the cache block \(i \mathbin {\text {mod}} G\). Then each BDD suboperation only uses the cache once for each cache block, by comparing the cache block of the parent operation and of the current operation.
We use an operation cache which, like the hash tables described above, consists of two arrays: the hash array and the data array. See Fig. 10 for the layout. Since we implement a lossy cache, the design of the operation cache is extremely simple. We do not implement a special strategy to deal with hash collisions, but simply overwrite the old results. There is a tradeoff between the cost of recomputing operations and the cost of synchronizing with the cache. For example, the caching granularity increases the number of recomputed operations but improves the performance in practice.
The most important concern for correctness is that every result obtained via cacheget was inserted earlier with cacheput, and the most important concern for performance is that the number of memory accesses is as low as possible. To ensure this, we use a 16bit “version tag” that increments (modulo 4096) with every update to the bucket, and check this value before reading and after reading the cache to check if the obtained result is valid. The chance of obtaining an incorrect result is astronomically small, as this requires precisely 4096 cacheput operations on the same bucket by other workers between the first and the second time the tag is read in cacheget, and the last of these 4096 other operations must have exactly the same hash value. Using a “version tag” like this is a wellknown technique that goes back to as early as 1975 [36, p. 125].
See Algorithm 4 for the cacheput algorithm and Algorithm 5 for the cacheget algorithm. The algorithms are quite straightforward. We use a 64bit hash function that returns sufficient bits for the 15bit h value and the location value. The h value is used for the hash in the hash array, and the location for the location of the bucket in the table. The cacheput operation aborts as soon as some problem arises, i.e., if the bucket is locked (line 4), or if the hash of the stored key matches the hash of the given key (line 5), or if the cas operation fails (line 6). If the cas operation succeeds, then the bucket is locked. The keyvalue pair is written to the cache array (line 7) and the bucket is unlocked (line 8, by setting the locked bit to 0).
In the cacheget operation, when the bucket is locked (line 4), we abort instead of waiting for the result. We also abort if the hashes are different (line 5). We read the result (line 6) and compare the key to the requested key (line 7). If the keys are identical, then we verify that the cache bucket has not been manipulated by a concurrent operation by comparing the “tag” counter (line 8).
As discussed above, it is possible that between lines 6–8 of the cacheget operation, exactly 4096 cacheput operations are performed on the same bucket by other workers, where the last one has exactly the same hash. The chances of this occurring are astronomically small. The reason we choose this design is that this implementation of cacheget only reads from memory and never writes. Memory writes cause additional communication between processors and with the memory when writing to the cacheline, and also force other processor caches to invalidate their copy of the bucket. We also want to avoid locking buckets for reading, because locking often causes bottlenecks. Since there are no loops in either algorithm, both algorithms are waitfree.
5.4 Garbage collection
Operations on decision diagrams typically create many new nodes and discard old nodes. Nodes that are no longer referenced are called “dead nodes”. Garbage collection, which removes dead nodes from the unique table, is essential for the implementation of decision diagrams. Since dead nodes are often reused in later operations, garbage collection should be delayed as long as possible [55].
There are various approaches to garbage collection. For example, a reference count could be added to each node, which records how often a node is referenced. Nodes with a reference count of zero are either immediately removed when the count decreases to zero, or during a separate garbage collection phase. Another approach is markandsweep, which marks all nodes that must be kept and removes all unmarked nodes. We refer to [55] for a more indepth discussion of garbage collection.
For a parallel implementation, reference counts can incur a significant cost, as accessing nodes implies continuously updating the reference count, increasing the amount of communication between processors, as writing to a location in memory requires all other processors to refresh their view on that location. This is not a severe issue with only one processor, but with many processors this results in excessive communication, especially for nodes that are often used.
When parallelizing decision diagram operations, we can choose to perform garbage collection “onthefly”, allowing other workers to continue inserting nodes, or we can “stoptheworld” and have all workers cooperate on garbage collection. We use a separate garbage collection phase, during which no new nodes are inserted. This greatly simplifies the design of the hash table, and we see no major advantage to allow some workers to continue inserting nodes during garbage collection.
Some decision diagram implementations use a global variable that counts how many buckets in the nodes table are in use and triggers garbage collection when a certain percentage of the table is in use. We want to avoid global counters like this and instead use a bounded probe sequence for the nodes table: when the algorithm cannot find an empty bucket in the first K buckets, garbage collection is triggered. In simulations and experiments, we find that this occurs when the hash table is between 80 and 95 % full.
As described above, decision diagram nodes are stored in a “data array”, separated from the metadata of the unique table, which is stored in the “hash array”. Nodes can be removed from the hash table without deleting them from the data array, simply by clearing the hash array. The nodes can then be reinserted during garbage collection, without changing their location in the data array, thus preserving the identity of the nodes.
 1.
Initiate the operation using the Lace framework to arrange the “stoptheworld” interruption of all ongoing tasks.
 2.
Clear the hash array of the unique table, and clear the operation cache. The operation cache is cleared instead of checking each entry individually after garbage collection, although that is also possible.
 3.
Mark all nodes that we want to keep, using various mechanisms that keep track of the decision diagram nodes that we want to keep (see below).
 4.
Count the number of kept nodes and optionally increase the size of the unique table. Also optionally change the size of the operation cache.
 5.
Rehash marked nodes in the hash array of the unique table.

The sylvan_protect and sylvan_unprotect methods maintain a set of pointers. During garbage collection, each pointer is inspected and the BDD is marked. This method is preferred for longlived external references.

Each thread has a threadlocal BDD stack, operated using the methods bdd_refs_push and bdd_refs_pop. This method is preferred to store intermediate results in BDD operations.

Each thread has a threadlocal Task stack, operated using the methods bdd_refs_spawn and bdd_refs_ sync. Tasks that return BDDs are stored in the stack, and during garbage collection the results of finished tasks are marked. This method is required when using SPAWN and SYNC on a task that returns a BDD.

The sylvan_ref and sylvan_deref methods maintain a set of BDDs to be marked during garbage collection. This is a standard method offered by many BDD implementations, but we recommend using sylvan_ protect and sylvan_unprotect instead.
The garbage collection process itself is also executed in parallel. Removing all nodes from the hash table and clearing the operation cache is an instant operation that is amortized over time by the operating system by reallocating the memory (see below). Marking nodes that must be kept occurs in parallel, mainly by implementing the marking operation as a recursive task using Lace. Counting the number of used nodes and rehashing all nodes (steps 4–5) is also parallelized using a standard binary divideandconquer approach, which distributes the memory pages over all workers.
5.5 Memory management
Memory in modern computers is divided into regions called pages that are typically (but not always) 4096 bytes in size. Furthermore, computers have a distinction between “virtual” memory and “real” memory. It is possible to allocate much more virtual memory than we really use. The operating system is responsible for assigning real pages to virtual pages and clearing memory pages (to zero) when they are first used.
We use this feature to implement resizing of our unique table and operation cache. We preallocate memory according to a maximum number of buckets. Via global variables table_size and max_size we control which part of the allocated memory is actually used. When the table is resized, we simply change the value of table_size. To free pages, the kernel can be advised to free real pages using a madvise call (in Linux), but Sylvan only implements increasing the size of the tables, not decreasing their size.
Furthermore, when performing garbage collection, we clear the operation cache and the hash array of the unique table by reallocating the memory. Then, the actual clearing of the used pages only occurs on demand by the operating system, when new information is written to the tables.
6 Algorithms on decision diagrams
Basic BDD operations on the input BDDs x, y, z
Operation  Implementation 

\(x \wedge y\)  \(\texttt {and}(x, y)\) 
\(x \vee y\)  \(\texttt {not}(\texttt {and}(\texttt {not}(x), \texttt {not}(y)))\) 
\(\lnot (x \wedge y)\)  \(\texttt {not}(\texttt {and}(x, y))\) 
\(\lnot (x \vee y)\)  \(\texttt {and}(\texttt {not}(x), \texttt {not}(y))\) 
\(x \oplus y\)  \(\texttt {xor}(x, y)\) 
\(x \leftrightarrow y\)  \(\texttt {not}(\texttt {xor}(x, y))\) 
\(x \rightarrow y\)  \(\texttt {not}(\texttt {and}(x, \texttt {not}(y)))\) 
\(x \leftarrow y\)  \(\texttt {not}(\texttt {and}(\texttt {not}(x), y))\) 
\(\text {if } x \text { then } y \text { else } z\)  \(\texttt {ite}(x, y, z)\) 
\(\exists v:x\)  \(\texttt {exists}(x, v)\) 
\(\forall v:x\)  \(\texttt {not}(\texttt {exists}(\texttt {not}(x), v))\) 
6.1 BDD algorithms
Sylvan implements the basic BDD operations (Table 1) and, not and xor, the ifthenelse (ite) operation, and exists. Implementing the basic operations in this way is common for BDD packages. Negation \(\lnot \) (not) is performed using complement edges, and is essentially free.
The parallelization of these functions is straightforward. See Algorithm 6 for the parallel implementation of and. This algorithm checks the trivial cases (lines 2–4) before the operation cache (line 5), and then runs the two independent suboperations (lines 8–9) in parallel.
Another operation that is parallelized similarly is the compose operation, which performs functional composition, i.e., substitute occurrences of variables in a Boolean formula by Boolean functions. For example, the substitution \([x_1:=x_2\vee x_3, x_2:=x_4\vee x_5]\) applied to the function \(x_1\wedge x_2\) results in the function \((x_2\vee x_3)\wedge (x_4\vee x_5)\). Sylvan offers a functional composition algorithm based on a “BDDMap”. This structure is not a BDD itself, but uses BDD nodes to encode a mapping from variables to BDDs. A BDDMap is based on a disjunction of variables, but with the “high” edges going to BDDs instead of the terminal 1. This method also implements substitution of variables, e.g. \([x_1:=x_2,x_2:=x_3]\). See Algorithm 7 for the algorithm compose. This parallel algorithm is similar to the algorithms described above, with the composition functionality at lines 10–11. If the variable is in the mapping M, then we use the ifthenelse method to compute the substitution. If the variable is not in the mapping M, then we simply compute the result using lookupBDDnode.
Sylvan also implements parallelized versions of the BDD minimization algorithms restrict and constrain (also called generalized cofactor), based on siblingsubstitution, which are described in [20] and parallelized similarly as the and algorithm above.
Typically, we want the BDD of the successors states defined on the unprimed variables \(\vec x\) instead of the primed variables \(\vec x'\), so the and_exists call is then followed by a variable substitution that replaces all occurrences of variables from \(\vec x'\) by the corresponding variables from \(\vec x\). Furthermore, the variables are typically interleaved in the variable ordering, like \(x_1,x'_1,x_2,x'_2,\ldots ,x_N,x'_N\), as this often results in smaller BDDs. Sylvan implements specialized operations relnext and relprev that compute the successors and the predecessors of sets of states, where the transition relation is encoded with the interleaved variable ordering. See Algorithm 8 for the implementation of relnext. This function takes as input a set S, a transition relation R, and the set of variables V, which is the union of the interleaved sets \(\vec x\) and \(\vec x'\) (the variables on which the transition relation is defined). We first check for terminal cases (lines 2–3). These are the same cases as for the \(\wedge \) operation. Then we process the set of variables V to skip variables that are not in S and R (lines 5–6). After consulting the cache (line 7), either the current variable is in the transition relation, or it is not. If it is not, we perform the usual recursive calls and compute the result (lines 21–24). If the current variable is in the transition relation, then we let x and \(x'\) be the two relevant variables (either of these equals v) and compute four subresults, namely for the transition (a) from 0 to 0, (b) from 1 to 0, (c) from 0 to 1, and (d) from 1 to 1 in parallel (lines 11–15). We then abstract from \(x'\) by computing the existential quantifications in parallel (lines 16–18), and finally compute the result (line 19). This result is stored in the cache (line 25) and returned (line 26). We implement relprev similarly.
6.2 MTBDD algorithms
Although multiterminal binary decision diagrams are often used to represent functions to integers or real numbers, they could be used to represent functions to any domain. In practice, the wellknown BDD package CUDD [56] implements MTBDDs with double (floatingpoint) leaves. For some applications, other types of leaves are required, for example to represent rational numbers or integers. To allow different types of MTBDDs, we designed a generic customizable framework. The idea is that anyone can use the given functionality or extend it with other leaf types or other operations.
Leaf type  Function type 

BDDs false and true  Total functions \(\mathbb {B}^N\rightarrow \mathbb {B}\) (BDDs) 
64bit integer (uint64)  Partial functions \(\mathbb {B}^N\rightarrow \mathbb {N}\) 
Floatingpoint (double)  Partial functions \(\mathbb {B}^N\rightarrow \mathbb {R}\) 
Rational leaves  Partial functions \(\mathbb {B}^N\rightarrow \mathbb {Q}\) 
GMP library leaves (mpq)  Partial functions \(\mathbb {B}^N\rightarrow \mathbb {Q}\) 
Sylvan implements a generic binary apply function, a generic monadic apply function, and a generic abstraction algorithm. The implementation of binary apply is similar to Algorithm 1. See Algorithm 9 for the implementation of abstraction. On top of these generic algorithms, we implemented basic operators plus, times, min and max for the default leaf types. For all valuations of MTBDDs x and y that end in leaves a and b, they compute \(a+b\), \(a\times b\), \(\min (a, b)\) and \(\max (a, b)\). For Boolean MTBDDs, the plus and times operators are similar to \(\vee \) and \(\wedge \). When using times with a Boolean MTBDD (or a BDD) and an MTBDD of some other type, it acts as a filter, removing the subgraphs where the BDD is false.

hash(value, seed) computes a 64bit hash for the leaf value and the given 64bit seed.

equals(value1, value2) returns 1 if the two values encode the same leaf, and 0 otherwise. The default implementation simply compares the two values.

create(pointer) is called when a new leaf is created with the 64bit value references by the pointer; this allows implementations to allocate memory and replace the referenced value with the final value.

destroy(value) is called when the leaf is garbage collected so the implementation can free memory allocated by create.

hash follows the pointer and hashes the contents of the mpq datastructure.

equals follows the pointers and compares their contents.

create clones the mpq datastructure and writes the address of the clone to the given new leaf.

destroy frees the memory of the cloned datastructure.
6.3 LDD algorithms
We implemented various LDD operations that are required for model checking in LTSmin (see Sect. 7), such as the set operations union, intersect, and minus. We implemented project (existential quantification), enumerate (for enumeration of elements in a set) and the two relational operations relprod and relprev. These operations are all recursive and hence trivial to parallelize using the workstealing framework Lace and the datastructures earlier developed for the BDD operations.
7 Application: parallelism in LTSmin
One major application for which we developed Sylvan is symbolic model checking. This section describes one of the main algorithms in symbolic model checking, which is symbolic reachability. We describe the implementation of onthefly symbolic reachability in the model checking toolset LTSmin, and show how we parallelize symbolic reachability using the disjunctive partitioning of transition relations that LTSmin offers, and how we parallelize onthefly transition learning using a custom BDD operation.
7.1 Onthefly symbolic reachability in LTSmin
In model checking, we create models of complex systems to verify that they function according to certain properties. Systems are modeled as a set of possible states of the system and a set of transitions between these states. Many model checking algorithms depend on statespace generation using a reachability algorithm, for example to calculate all states that are reachable from the initial state of the system, or to check if an invariant is always true, and so forth.
The Pins interface The model checking toolset LTSmin provides a language independent Partitioned NextState Interface (Pins), which connects various input languages to model checking algorithms [7, 24, 38, 42, 44]. In Pins, the states of a system are represented by vectors of N integer values. Furthermore, transitions are distinguished in K disjunctive “transition groups”, i.e., each transition in the system belongs to one of these transition groups. The transition relation of each transition group usually only depends on a subset of the entire state vector called the “short vector”. This enables the efficient encoding of transitions that only affect some integers of the state vector. Variables in the short vector are further distinguished by the notions of read dependency and write dependency [44]: the variables that are inspected or read to obtain new transitions are in the “read vector” of the transition group, and the variables that can be modified by transitions in the transition group are in the “write vector”. An example of a variable that is only in the read vector is a guard; when a variable is only in the write vector, then its original value is irrelevant. Computing short vectors from long vectors is called “projection” in LTSmin and is similar to existential quantification.
Learning transitions
The symbolic reachability algorithm with K transition groups and onthefly learning is given in Algorithm 10. This algorithm is an extension of the standard breadthfirstsearch (BFS) reachability algorithm with a frontier set. Algorithm 10 iteratively discovers new states until no new states are found (line 4). For every transition group (line 5), the transition group is updated with new transitions learned from the frontier set (line 6). The updated transition relation relations[k] is then used to symbolically find all successors of the states in the frontier set (line 7). This uses the relnext operation that is described in Sect. 6.1. The sets of new states discovered for every transition group are pairwise merged into the new set frontier (line 8). Successors that have been found in earlier iterations are removed (line 9). All new states are then added to the set of discovered states states (line 10). When no new states are discovered, the set of discovered states is returned (line 11).
7.2 Parallel onthefly symbolic reachability
Even with parallel BDD operations, the parallel speedup of model checking in LTSmin is limited, especially for smaller models, where the size of “work units” (between sequential points in the algorithm) is small and when there are few independent tasks. Experiments in [23] demonstrate this limitation. This is expected: if a parallel program consists of many small operations between sequential points, or if small input BDDs result in few independent tasks, then we expect limited parallel scalability.
In addition, we parallelize the learning algorithm, using a special BDD algorithm collect that combines enumeration and union. In the new implementation, the callback for enumeration does not add the learned transitions to the transition relation, which would result in race conditions, but returns the learned transitions as a BDD or LDD. These sets of transitions are then merged by collect. See Algorithm 12. This algorithm uses the “states” BDD and the set of variables “vars” to generate all state vectors “vec”. For every state in the set of states, a callback is called (line 2). The callback nextstate returns a BDD containing the transitions from the given short state. All learned transitions are then merged and returned (line 8).
8 Experimental evaluation
In the current section, we evaluate the LDD extension of Sylvan, and the application of parallelization to LTSmin. Compared to [26], we add benchmark results with the new unique table, using LDDs and the fully parallel strategy. We use the same machine as in [26] and confirmed that the original benchmarks still yield comparable results.
We select as the fixed size of the unique table \(2^{30}\) buckets and as of the operation cache also \(2^{30}\) buckets (24 GB for the unique table and 36 GB for the operation cache). Using the parameter –order we either select the parprev variation or the bfsprev variation. The bfsprev variation does not have parallelism in LTSmin, but uses the parallelized LDD operations, including collect. This means that there is parallel learning, but only for one transition group at a time. In the parprev variation, learning and computing the successors are performed for all transition groups in parallel.
The full experiment data and benchmark scripts can be found online.^{1}
8.1 Comparing parprev and bfsprev
Experiment  \(T_1\)  \(T_{48}\)  \(T_{1}/T_{48}\) 

blocks.4 (par)  629.54  16.58  38.0 
blocks.4 (bfs)  630.04  21.69  29.0 
lifts.8 (par)  377.52  12.03  31.4 
lifts.8 (bfs)  377.36  26.11  14.5 
firewire_tree.1 (par)  16.40  0.99  16.5 
firewire_tree.1 (bfs)  16.43  11.35  1.4 
Sum of all parprev  20,756  1298  16.0 
Sum of all bfsprev  20,745  3737  5.6 
For an overview of the obtained speedups on the entire benchmark set, see Fig. 12. Here we see that “larger” models (higher \(T_1\)) are associated with a higher parallel speedup. This plot also shows the benefit of adding parallelism on the algorithmic level, as many models in the fully parallel version have higher speedups. One of the largest improvements was obtained with the firewire_tree.1 model, which went from 1.4\(\times \) to 16.5\(\times \). We conclude that the lack of parallelism is a bottleneck, which can be alleviated by exploiting the disjunctive partitioning of the transition relation.
8.2 Comparing the old and the new unique table
Experiment  \(T_1\)  \(T_{48}\)  \(T_{1}/T_{48}\) 

Sum of all benchmarks (old)  20,756  1298  16.0 
Sum of all benchmarks (now)  16,357  907  18.0 
8.3 Comparing BDDs and LDDs
Finally, we compared the performance of our multicore BDD and LDD variants for the parprev variation of onthefly symbolic reachability. Figure 15 shows that the majority of models, especially larger models, are performed up to several orders of magnitude faster using LDDs. The most extreme example is model frogs.3, which has for BDDs \(T_1=989.40\), \(T_{48}=1005.96\) and for LDDs \(T_1=61.01\), \(T_{48}=9.36\). The large difference suggests that LDDs are a more efficient representation for the models of the BEEM database.
8.4 Recent experiments in related work
Backend  Time (s)  Backend  Time (s) 

sylvan7  608  buddy  2156 
cacbdd  1433  jdd  2439 
cuddbdd  1522  beedeedee  2598 
sylvan1  1838  cuddmtbdd  2837 
This result was produced with a version of Sylvan before the extensions that we present in the current paper. As the results show, Sylvan is competitive with other BDD implementations when used sequentially (with 1 worker) and benefits from parallelism (with 7 workers they obtained a speedup of 3\(\times \)).
Recently, we used Sylvan for the implementation of symbolic bisimulation minimization [27]. For this particular application of binary decision diagrams, it is very beneficial to develop custom BDD operations and to use the MTBDD implementation we present in the current paper, especially for Continuous Time Markov Chains and Interactive Markov Chains models, for which support for rational numbers in the MTBDD leaves is highly preferred. Compared to the state of the art tool Sigref [59] that relies on a version of CUDD, we obtained a sequential speedup of up to 95\(\times \) and a parallel speedup of up to 17\(\times \) using 48 workers on benchmarks from the literature [27].
9 Conclusion
This paper presented the design and implementation of a multicore framework for decision diagrams, called Sylvan. Sylvan already supported parallel operations for (binary) BDDs and (multiway) MDDs in the form of (list) LDDs. The most recent extension is the support for multiple terminals, i.e., a new framework for MTBDDs that supports various types of MTBDD leaves and was designed for customization. We discussed several BDD and MTBDD operations and offered an example implementation of custom MTBDD leaves and operations using the GMP library. We also discussed a new hash table design for Sylvan and showed clear improvements over the previous version. The new table supports parallelized garbage collection for decision diagrams and offers extensive support for customizable “marking” mechanisms.
Using Sylvan, one can very easily speedup sequential symbolic algorithms, by replacing the BDD operations by calls to their parallel implementation in Sylvan. On top of this, the framework also supports the parallelization of the higherlevel algorithm itself, by allowing concurrent calls to BDD operations. This integration is based on our customizable workstealing task scheduler Lace. We demonstrated this for parallel symbolic model checking with LTSmin. Experimentally, we demonstrated a speedup of up to 38\(\times \) (with 48 cores) for fully parallel onthefly symbolic reachability in LTSmin, and an average of 18\(\times \) for all the BEEM benchmark models using the new hash table.
Initially, it was not clear whether a BDD package could significantly profit from parallelization on a multicore shared memory computer. BDD operations are highly memoryintensive and show irregular memory access patterns, so they belong to the hardest class to achieve practical speedup. We believe that there are three ingredients that enabled us to achieve efficient parallelism: The first is that we adapted the scalable, lockless hashtable design that already proved its value in explicitstate model checking [41]. The second is that we carefully designed the workstealing task scheduler Lace [25] to handle the very small individual BDD steps. Finally, we followed a pragmatic approach: for instance, we just give up cache operations in case of race conditions, rather than retrying them. The experiments in this paper show that further improvements in the concurrent hashtable and cache design not only led to reduced running times sequentially, but even to a higher speedup.
Our measurements in recent papers and in the current paper show that sequential symbolic algorithms benefit from the “automatic” parallelization provided by the parallel decision diagram operations in Sylvan. We also demonstrated that adding parallelism to higherlevel algorithms can result in even higher speedups. In general, multicore decision diagrams can speed up symbolic model checking considerably.
Sylvan is available online^{2} and is released under the Apache 2.0 License, so that anyone can freely use it and extend it. It comes with an example of a simple BDDbased reachability algorithm, which demonstrates how to use Sylvan to “automatically” parallelize sequential algorithms. A more elaborate example of how applications can add custom BDD operations can be found in the SigrefMC ^{3} tool.
An interesting extension of our work would be the addition of dynamic variable reordering. This could either be achieved by a parallel implementation of the sifting algorithm [52], or by a parallel investigation of different variable orderings. Another interesting research question would be to investigate whether the experimental speedups can be maintained for smarter exploration strategies. These strategies determine in which order the symbolic subtransitions are fired. We have investigated the BFS and chaining strategies, since they are external to the decision diagram implementation. It is still open whether the full saturation strategy [16] also profits from parallelization, but this requires a tight integration of the exploration strategy into the multicore decision diagram operations.
Footnotes
 1.
For the benchmarks from [26] at https://github.com/utwentefmt/sylvantacas2015 and at https://github.com/utwentefmt/sylvansttt for the benchmarks presented here.
 2.
 3.
References
 1.Acar, U.A., Charguéraud, A., Rainey, M.: Scheduling parallel programs by work stealing with private deques. In: PPOPP, pp. 219–228. ACM (2013)Google Scholar
 2.Akers, S.: Binary decision diagrams. IEEE Trans. Comput. C27(6), 509–516 (1978)Google Scholar
 3.Arunachalam, P., Chase, C.M., Moundanos, D.: Distributed binary decision diagrams for verification of large circuit. In: ICCD, pp. 365–370 (1996)Google Scholar
 4.Bahar, R.I., Frohm, E.A., Gaona, C.M., Hachtel, G.D., Macii, E., Pardo, A., Somenzi, F.: Algebraic decision diagrams and their applications. In: ICCAD 1993, pp. 188–191 (1993)Google Scholar
 5.Bianchi, F., Corno, F., Rebaudengo, M., Reorda, M.S., Ansaloni, R.: Boolean function manipulation on a parallel system using BDDs. In: HPCN Europe, pp. 916–928 (1997)Google Scholar
 6.Blom, S., van de Pol, J.: Symbolic reachability for process algebras with recursive data types. In: ICTAC, LNCS, vol. 5160, pp. 81–95. Springer (2008)Google Scholar
 7.Blom, S., van de Pol, J., Weber, M.: LTSmin: distributed and symbolic reachability. In: CAV, LNCS, vol. 6174, pp. 354–359. Springer (2010)Google Scholar
 8.Blumofe, R.D.: Scheduling multithreaded computations by work stealing. In: FOCS, pp. 356–368. IEEE Computer Society (1994)Google Scholar
 9.Blumofe, R.D., Joerg, C.F., Kuszmaul, B.C., Leiserson, C.E., Randall, K.H., Zhou, Y.: Cilk: an efficient multithreaded runtime system. J. Parallel Distrib. Comput. 37(1), 55–69 (1996)CrossRefGoogle Scholar
 10.Brace, K.S., Rudell, R.L., Bryant, R.E.: Efficient implementation of a BDD package. In: DAC, pp. 40–45 (1990)Google Scholar
 11.Bryant, R.E.: Graphbased algorithms for Boolean function manipulation. IEEE Trans. Comput. C35(8), 677–691 (1986)Google Scholar
 12.Burch, J., Clarke, E., Long, D., McMillan, K., Dill, D.: Symbolic model checking for sequential circuit verification. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 13(4), 401–424 (1994)CrossRefGoogle Scholar
 13.Cabodi, G., Gai, S., Sonza Reorda, M.: Boolean function manipulation on massively parallel computers. In: Proceedings of 4th Symposium on Frontiers of Massively Parallel Computation, pp. 508–509. IEEE (1992)Google Scholar
 14.Chen, J., Banerjee, P.: Parallel construction algorithms for BDDs. In: ISCAS 1999, pp. 318–322. IEEE (1999)Google Scholar
 15.Chung, M.Y., Ciardo, G.: Saturation NOW. In: QEST, pp. 272–281. IEEE Computer Society (2004)Google Scholar
 16.Ciardo, G., Lüttgen, G., Siminiceanu, R.: Saturation: an efficient iteration strategy for symbolic statespace generation. In: TACAS, LNCS, vol. 2031, pp. 328–342 (2001)Google Scholar
 17.Ciardo, G., Marmorstein, R.M., Siminiceanu, R.: Saturation unbound. In: TACAS 2003, pp. 379–393 (2003)Google Scholar
 18.Ciardo, G., Zhao, Y., Jin, X.: Parallel symbolic statespace exploration is difficult, but what is the alternative? In: PDMC, pp. 1–17 (2009)Google Scholar
 19.Clarke, E.M., McMillan, K.L., Zhao, X., Fujita, M., Yang, J.: Spectral transforms for large Boolean functions with applications to technology mapping. In: DAC, pp. 54–60 (1993)Google Scholar
 20.Coudert, O., Madre, J.C.: A unified framework for the formal verification of sequential circuits. In: ICCAD 1990, pp. 126–129. IEEE Computer Society (1990)Google Scholar
 21.van Dijk, T.: Sylvan: Multicore decision diagrams. Ph.D. thesis, University of Twente (2016)Google Scholar
 22.van Dijk, T., Hahn, E.M., Jansen, D.N., Li, Y., Neele, T., Stoelinga, M., Turrini, A., Zhang, L.: A comparative study of BDD packages for probabilistic symbolic model checking. In: SETTA, LNCS, vol. 9409, pp. 35–51. Springer (2015)Google Scholar
 23.van Dijk, T., Laarman, A., van de Pol, J.: Multicore BDD operations for symbolic reachability. ENTCS 296, 127–143 (2013)Google Scholar
 24.van Dijk, T., Laarman, A.W., van de Pol, J.: Multicore and/or symbolic model checking. ECEASST 53 (2012)Google Scholar
 25.van Dijk, T., van de Pol, J.: Lace: nonblocking split deque for workstealing. In: MuCoCoS, LNCS, vol. 8806, pp. 206–217. Springer (2014)Google Scholar
 26.van Dijk, T., van de Pol, J.: Sylvan: Multicore decision diagrams. In: TACAS, LNCS, vol. 9035, pp. 677–691. Springer (2015)Google Scholar
 27.van Dijk, T., van de Pol, J.: Multicore symbolic bisimulation minimisation. In: TACAS, LNCS, vol. 9636, pp. 332–348. Springer (2016)Google Scholar
 28.Ezekiel, J., Lüttgen, G., Ciardo, G.: Parallelising symbolic statespace generators. In: CAV, LNCS, vol. 4590, pp. 268–280 (2007)Google Scholar
 29.Faxén, K.: Efficient work stealing for fine grained parallelism. In: ICPP 2010, pp. 313–322. IEEE Computer Society (2010)Google Scholar
 30.Faxén, K.F.: Wool—a work stealing library. SIGARCH Comput. Archit. News 36(5), 93–100 (2008)CrossRefGoogle Scholar
 31.Frigo, M., Leiserson, C.E., Randall, K.H.: The implementation of the Cilk5 multithreaded language. In: PLDI, pp. 212–223. ACM (1998)Google Scholar
 32.Gai, S., Rebaudengo, M., Sonza Reorda, M.: An improved data parallel algorithm for Boolean function manipulation using BDDs. In: Proceedings of Euromicro Workshop on Parallel and Distributed Processing, pp. 33–39. IEEE (1995)Google Scholar
 33.Grumberg, O., Heyman, T., Schuster, A.: A workefficient distributed algorithm for reachability analysis. Form. Methods Syst. Des. 29(2), 157–175 (2006)CrossRefzbMATHGoogle Scholar
 34.Hahn, E.M., Li, Y., Schewe, S., Turrini, A., Zhang, L.: iscasmc: A webbased probabilistic model checker. In: FM, LNCS, vol. 8442, pp. 312–317. Springer (2014)Google Scholar
 35.Heyman, T., Geist, D., Grumberg, O., Schuster, A.: Achieving scalability in parallel reachability analysis of very large circuits. In: CAV, LNCS, vol. 1855, pp. 20–35. Springer, Berlin/Heidelberg (2000)Google Scholar
 36.IBM: IBM System/370, Principles of Operation. IBM Publication No. GA2270004 (1975)Google Scholar
 37.Kam, T., Villa, T., Brayton, R.K., SangiovanniVincentelli, A.L.: Multivalued decision diagrams: theory and applications. Mult. Valued Log. 4(1), 9–62 (1998)MathSciNetzbMATHGoogle Scholar
 38.Kant, G., Laarman, A., Meijer, J., van de Pol, J., Blom, S., van Dijk, T.: LTSmin: highperformance languageindependent model checking. In: TACAS 2015, LNCS, vol. 9035, pp. 692–707. Springer (2015)Google Scholar
 39.Kimura, S., Clarke, E.M.: A parallel algorithm for constructing binary decision diagrams. In: Proceedings of International Conference on Computer Design: VLSI in Computers and Processors ICCD, pp. 220–223 (1990)Google Scholar
 40.Kimura, S., Igaki, T., Haneda, H.: Parallel binary decision diagram manipulation. IEICE Trans. Fundam. Electron. Commun. Comput. Sci. E75A(10), 1255–1262 (1992)Google Scholar
 41.Laarman, A., van de Pol, J., Weber, M.: Boosting multicore reachability performance with shared hash tables. In: FMCAD 2010, pp. 247–255. IEEE (2010)Google Scholar
 42.Laarman, A.W., van de Pol, J., Weber, M.: Multicore LTSmin: marrying modularity and scalability. In: NFM 2011, LNCS, vol. 6617, pp. 506–511. Springer (2011)Google Scholar
 43.Lovato, A., Macedonio, D., Spoto, F.: A threadsafe library for binary decision diagrams. In: SEFM, LNCS, vol. 8702, pp. 35–49. Springer (2014)Google Scholar
 44.Meijer, J., Kant, G., Blom, S., van de Pol, J.: Read, write and copy dependencies for symbolic model checking. In: Yahav, E. (ed.) HVC, LNCS, vol. 8855, pp. 204–219. Springer (2014)Google Scholar
 45.Miller, D.M., Drechsler, R.: On the construction of multiplevalued decision diagrams. In: ISMVL, pp. 245–253 (2002)Google Scholar
 46.MilvangJensen, K., Hu, A.J.: BDDNOW: a parallel BDD package. In: FMCAD, pp. 501–507 (1998)Google Scholar
 47.Ochi, H., Ishiura, N., Yajima, S.: Breadthfirst manipulation of SBDD of Boolean functions for vector processing. In: DAC, pp. 413–416 (1991)Google Scholar
 48.Oortwijn, W.: Distributed symbolic reachability analysis. Master’s thesis, University of Twente, Dept. of C.S. (2015)Google Scholar
 49.Ossowski, J.: JINC—a multithreaded library for higherorder weighted decision diagram manipulation. Ph.D. thesis, Rheinischen FriedrichWilhelmsUniversität Bonn (2010)Google Scholar
 50.Parasuram, Y., Stabler, E.P., Chin, S.K.: Parallel implementation of BDD algorithms using a distributed shared memory. In: HICSS, vol. 1, pp. 16–25 (1994)Google Scholar
 51.Pelánek, R.: BEEM: benchmarks for explicit model checkers. In: SPIN, pp. 263–267. SpringerVerlag, Berlin, Heidelberg (2007)Google Scholar
 52.Rudell, R.: Dynamic variable ordering for ordered binary decision diagrams. In: ICCAD, pp. 42–47 (1993)Google Scholar
 53.Sanghavi, J.V., Ranjan, R.K., Brayton, R.K., SangiovanniVincentelli, A.L.: High performance BDD package by exploiting memory hiercharchy. In: DAC, pp. 635–640 (1996)Google Scholar
 54.Sewell, P., Sarkar, S., Owens, S., Nardelli, F.Z., Myreen, M.O.: x86TSO: a rigorous and usable programmer’s model for x86 multiprocessors. Commun. ACM 53(7), 89–97 (2010)CrossRefGoogle Scholar
 55.Somenzi, F.: Efficient manipulation of decision diagrams. STTT 3(2), 171–181 (2001)zbMATHGoogle Scholar
 56.Somenzi, F.: CUDD: CU decision diagram package release 3.0.0. http://vlsi.colorado.edu/~fabio/CUDD/ (2015)
 57.Stornetta, T., Brewer, F.: Implementation of an efficient parallel BDD package. In: DAC, pp. 641–644 (1996)Google Scholar
 58.Velev, M.N., Gao, P.: Efficient parallel GPU algorithms for BDD manipulation. In: ASPDAC, pp. 750–755. IEEE (2014)Google Scholar
 59.Wimmer, R., Herbstritt, M., Hermanns, H., Strampp, K., Becker, B.: Sigref—a symbolic bisimulation tool box. In: ATVA, LNCS, vol. 4218, pp. 477–492. Springer (2006)Google Scholar
 60.Yang, B., O’Hallaron, D.R.: Parallel breadthfirst BDD construction. In: PPOPP, pp. 145–156 (1997)Google Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.