figure a

1 Introduction

Model checking is an exhaustive algorithm to verify that a finite model of a concurrent system satisfies certain temporal properties. The main challenge is to handle the large state space, resulting from the combination of parallel components. Symbolic model checking exploits regularities in the set of reachable states, by storing this set concisely in a decision diagram. In asynchronous systems, transitions have locality, i.e. they affect only a small part of the state vector. This locality is exploited in the saturation strategy, which is probably the most efficient strategy to compute the set of reachable states.

In this paper, we investigate the efficiency and speedup of a new parallel implementation of saturation, aiming at a multi-core, shared-memory implementation. The implementation is carried out in the parallel decision diagram framework Sylvan [16], in the language-independent model checker LTSmin [22]. We empirically evaluate the speedup of parallel saturation on Petri nets from the Model Checking Contest [24], running the algorithm on up to 48 cores.

1.1 Related Work

The saturation strategy has been developed and improved by Ciardo et al. We refer to [13] for an extensive description of the algorithm. Saturation derives its efficiency from firing all local transitions that apply at a certain level of the decision diagram, before proceeding to the next higher level. An important step in the development of the saturation algorithm allows on-the-fly generation of the transition relations, without knowing the cardinality of the state variable domains in advance [12]. This is essential to implement saturation in LTSmin, which is based on the PINS interface to discover transitions on-the-fly.

Since saturation obtains its efficiency from a restrictive firing order, it seems inherently sequential. Yet the problem of parallelising saturation has been studied intensively. The first attempt, Saturation NOW [9], used a network of PCs. This version could exploit the collective memory of all PCs, but due to the sequential procedure, no speedup was achieved. By firing local transitions speculatively (but with care to avoid memory waste), some speedup has been achieved [10]. More relevant to our work is the parallelisation of saturation for a shared memory architecture [20]. The authors used CILK to schedule parallel work originating from firing multiple transitions at the same level. They reported some speedup on a dual-core machine, at the expense of a serious memory increase. Their method also required to precompute the transition relation. An improvement of the parallel synchronisation mechanism was provided in [31]. They reported a parallel speedup of 2\(\times \) on 4 CPUs. Moreover, their implementation supports learning the transition relation on-the-fly. Still, the successful parallelisation of saturation remained widely open, as indicated by Ciardo [14]: “Parallel symbolic state-space exploration is difficult, but what is the alternative?”

For an extensive overview of parallel decision diagrams on various hardware architectures, see [15]. Here we mention some other approaches to parallel symbolic model checking, different from saturation for reachability analysis. First, Grumberg and her team [21] designed a parallel BDD package based on vertical partitioning. Each worker maintains its own sub-BDD. Workers exchange BDD nodes over the network. They reported some speedup on 32 PCs for BDD based model checking under the BFS strategy. The Sylvan [16] multi-core decision diagram package supports symbolic on-the-fly reachability analysis, as well as bisimulation minimisation [17]. Oortwijn [28] experimented with a heterogeneous distributed/multi-core architecture, by porting Sylvan’s architecture to RDMA over MPI, running symbolic reachability on 480 cores spread over 32 PCs and reporting speedups of BFS symbolic reachability up to 50. Finally, we mention some applications of saturation beyond reachability, such as model checking CTL [32] and detecting strongly connected components to detect fair cycles [33].

1.2 Contribution

Here we show that implementing saturation on top of the multi-core decision diagram framework Sylvan [16] yields a considerable speedup in a shared-memory setting of up to 32.5\(\times \) on 48 cores with pre-learned transition relations, and 52.2\(\times \) with on-the-fly transition learning.

By design decision, our implementation reuses several features provided by Sylvan, such as: its own fine-grained, work-stealing framework Lace [18], its implementation of both BDDs (Binary Decision Diagrams) and LDDs (a List-implementation of Multiway Decision Diagrams), its concurrent unique table and operations cache, and finally, its parallel operations like set union and relational product. As a consequence, the pseudocode of the algorithm and additional code for saturation is quite small, and orthogonal to other BDD features. To improve orthogonality with the existing decision diagrams, we deviated from the standard presentation of saturation [13]: we never update BDD nodes in situ, and we eliminated the mutual recursion between saturation and the BDD operations for relational product to fire transitions.

The implementation is available in the open-source high-performance model checking tool LTSmin  [22], with its language-agnostic interface, Partitioned Next-State Interface (PINS) [5, 22, 25]. Here, a specification basically provides a next-state function equipped with dependency information, from which LTSmin can derive locality information. We fully support the flexible method of learning the transition relation on-the-fly during saturation [12]. As a consequence, our contribution extends the tool LTSmin with saturation for various specification languages, like Promela, DVE, Petri nets, mCRL2, and languages supported by the ProB model checker. See Sect. 4 on how to use saturation in LTSmin.

The experiments with saturation in Sylvan are carried out in LTSmin as well. We used Petri nets from the MCC competition. Our experimental design has been carefully set up in order to facilitate fair comparisons. Besides learning the transition relation on-the-fly, we also pre-learned them in order to measure the overhead of learning, and eliminating its effect in comparisons. It is well known that the variable ordering has a large effect on the BDD sizes [29]. Hence, our experiments are based on two of the best static variable orderings known, Sloan [26] and Force [1]. In particular, our experiments measure and compare:

  • The performance of our parallel algorithm with one worker, compared to a state-of-the art sequential implementation of saturation in Meddly [4].

  • The parallel speedup of our algorithm on 16 cores, and for specific examples up to 48 cores.

  • The efficiency and speedup of saturation compared to the BFS and chaining strategies for reachability analysis.

  • The effect of choosing Binary Decision Diagrams or List Decision Diagrams.

  • The effect of choosing Sloan or Force to compute static variable orders.

2 Preliminaries

This paper proposes an algorithm for decision diagrams to perform the fixed point application of multiple transition relations according to the saturation strategy, combined with on-the-fly transition learning as implemented in LTSmin. We briefly review these concepts in the following.

2.1 Partitioned Transition Systems

A transition system (TS) is a tuple \((S, \rightarrow , s^0)\), where S is a set of states, \(\rightarrow \subseteq S \times S\) is a transition relation and \(s^0 \in S\) is the initial state. We define \(\rightarrow ^*\) to be the reflexive and transitive closure of \(\rightarrow \). The set of reachable states is \(R = \{s \in S \mid s^0 \rightarrow ^* s\}\). The goal of this work is to compute R via a novel multi-core saturation strategy.

Fig. 1.
figure 1

Example Petri net

In this paper, we evaluate multi-core saturation using Petri nets. Figure 1 shows an example of a (safe) Petri net. We show its initial marking, which is the initial state. A Petri net transition can fire if there is a token in each of its source places. On firing, these tokens are consumed and tokens in each target place are generated. For example, \(t_1\) will produce one token in both \(p_2\) and \(p_5\), if there is a token in \(p_4\). Transition \(t_6\) requires a token in both \(p_3\) and \(p_1\) to fire. The markings of this Petri net form the states of the corresponding TS, so here \(\vert S \vert = 2^5 = 32\). From the initial marking shown, four more markings are reachable, connected by 10 enabled transition firings. This means \(\vert R \vert = 5\), and \(\vert {\rightarrow }\vert = 10\).

Notice that transitions in Petri nets are quite local; transitions consume from, and produce into relatively few places. The firing of a Petri net transition is called an event and the number of involved places is known as the degree of event locality. This notion is easily defined for other asynchronous specification languages and can be computed by a simple control flow graph analysis.

To exploit event locality, saturation requires a disjunctive partitioning of the transition relation \(\rightarrow \), giving rise to a Partitioned Transition System (PTS). In a PTS, states are vectors of length N, and \(\rightarrow \) is partitioned as a union of M transition groups. A natural way to partition a Petri net is by viewing each transition as a transition group. For Fig. 1 this means we have \(N=5\) and \(M=6\). After disjunctive partitioning, each transition group depends on very few entries of the state vector. This allows for efficiently computing the reachable state space for the large class of asynchronous specification languages. LTSmin supports commonly used specification languages, like DVE, mCRL2, Promela, PNML for Petri nets, and languages supported by ProB.

Fig. 2.
figure 2

LDD for \(\{\left\langle 0,0 \right\rangle \!,\!\left\langle 0,2 \right\rangle \!,\!\left\langle 0,4 \right\rangle \!,\!\left\langle 1,0 \right\rangle \!,\!\left\langle 1,2 \right\rangle \!,\!\left\langle 1,4 \right\rangle \!,\!\left\langle 3,2 \right\rangle \!,\!\left\langle 3,4 \right\rangle \!,\!\left\langle 5,0 \right\rangle \!,\!\left\langle 5,1 \right\rangle \!,\!\left\langle 6,1 \right\rangle \}\).

2.2 Decision Diagrams

Binary decision diagrams (BDDs) are a concise and canonical representation of Boolean functions \(\mathbb {B}^N\rightarrow \mathbb {B}\) [7]. A BDD is a rooted directed acyclic graph with leaves 0 and 1. Each internal node v has a variable label \(x_i\), denoted by \(\mathrm {var}(v)\), and two outgoing edges labeled 0 and 1, denoted by \(\mathrm {low}(v)\) and \(\mathrm {high}(v)\). The efficiency of reduced, ordered BDDs is achieved by minimizing the structure with some invariants: The BDD may neither contain equivalent nodes, with the same \(\mathrm {var}(v)\), \(\mathrm {low}(v)\) and \(\mathrm {high}(v)\), nor redundant nodes, with \(\mathrm {low}(v)=high(v)\). Also, the variables must occur according to a fixed ordering along each path.

Multi-valued or multiway decision diagrams (MDDs) generalize BDDs to finite domains (\(\mathbb {N}^N\rightarrow \mathbb {B}\)). Each internal MDD node with variable \(x_i\) now has \(n_i\) outgoing edges, labeled 0 to \(n_i-1\). We use quasi-reduced MDDs with sparse nodes. In the sparse representation, values with edges to leaf 0 are skipped from MDD nodes, so outgoing edges must be explicitly labeled with remaining domain values. Contrary to BDDs, MDDs are usually “quasi-reduced”, meaning that variables are never skipped. In that case, the variable \(x_i\) can be derived from the depth of the MDD, so it is not stored.

A variation of MDDs are list decision diagrams (LDDs) [5, 16], where sparse MDD nodes are represented as a linked list. See Fig. 2 for two visual representations of the same LDD. Each LDD node contains a value, a “down” edge for the corresponding child, and a “right” edge pointing to the next element in the list. Each list ends with the leaf 0 and each path from the root downwards ends with the leaf 1. The values in an LDD are strictly ordered, i.e., the values must increase to the “right”.

LDD nodes have the advantage that common suffixes can be shared: The MDD for Fig. 2a requires two more nodes, one for [2, 4] and one for [1], because edges can only point to an entire MDD node. LDDs suffer from an increased memory footprint and inferior memory locality, but their memory management is simpler, since each LDD node has a fixed small size.

Fig. 3.
figure 3

Dependency matrices of Fig. 1.

2.3 Variable Orders and Event Locality

Good variable orders are crucial for efficient operations on decision diagrams. The syntactic variable order from the specification is often inadequate for the saturation algorithm to perform well. Hence, finding a good variable order is necessary. Variable reordering algorithms use heuristics based on event locality. The locality of events can be illustrated with dependency matrices. The size of those matrices is \(M \times N\), where M is the number of transition groups, and N is the length of the state vector. The order of columns in dependency matrices determines the order of variables in the DD. Figure 3a shows the natural order on places in Fig. 1. A measure of event locality is called event span [29]. Lower event span is correlated to a lower number of nodes in decision diagrams. This can be seen in LDDs in Figs. 4a and b that are ordered according to columns in Figs. 3a and b respectively.

Fig. 4.
figure 4

Reachable states as LDDs with different orders on places

Event span is defined as the sum over all rows of the distance from the leftmost non-zero column to the rightmost non-zero column. The event span of Fig. 3a is 22 (\(= 4 + 2 + 2 + 5 + 5 + 4\)); the event span of Fig. 3b is 16, which is better. Optimizing the event span and thus variable order of DDs is NP-complete [6], yet there are heuristic approaches that run in subquadratic time and provide good enough orders. Commonly used algorithms are Noack [27], Force [1] and Sloan [30]. Noack creates a permutation of variables by iteratively minimizing some objective function. The Force algorithm acts as if there are springs in between nonzeros in the dependency matrix, and tries to minimize the average tension among them. Sloan tries to minimize the profile of matrices. In short, profile is the symmetric counterpart to event span. For a more detailed overview of these algorithms see [3]. In our empirical evaluation we use both Sloan and Force, because these have been shown to give the best results [2, 26].

2.4 The Saturation Strategy

The saturation strategy for reachability analysis, i.e., the transitive closure of transition relations applied to some set of states, was first proposed by Ciardo et al. See for an overview [11, 13]. Saturation was combined with on-the-fly transition learning in [12]. Besides reachability, saturation has also been applied to CTL model checking [32] and in checking fairness constraints with strongly connected components [33].

Saturation is well-studied. The core idea is to always fire enabled transitions at the lower levels in the decision diagram, before proceeding to the next level. This tends to keep the intermediate BDD sizes much smaller than for instance the breadth-first exploration strategy. This is in particular the case for asynchronous systems, where transitions exhibit locality. There is also a major influence from the variable reordering: if the variables involved in a transition are grouped together, then this transition only affects adjacent levels in the decision diagram.

We refer to [13] for a precise description of saturation. Our implementation deviates from the standard presentation in three ways. First, we implemented saturation for LDDs and BDDs, instead of MDDs. Next, we never update nodes in the LDD forest in situ; instead, we always create new nodes. Finally, the standard representation has a mutual recursion between saturation and firing transitions. Instead, we fire transition using the existing function for relational product, which is called from our saturation algorithm. As a consequence, the extension with saturation becomes more orthogonal to the specific decision diagram implementation. We refer to Sect. 3 for a detailed description of our algorithm. We show in Sect. 5 that these design decisions do not introduce computational overhead.

3 Multi-core Saturation Algorithm

To access the three elements of an LDD node x, Sylvan [16] provides the functions \(\texttt {value}(x)\), \(\texttt {down}(x)\), \(\texttt {right}(x)\). To create or retrieve a unique LDD node using the hash table, Sylvan provides \(\texttt {LookupLDDNode}(\textit{value},\textit{down},\textit{right})\).

Furthermore, Sylvan provides several operations on LDDs that we use to implement reachability algorithms, such as \(\texttt {union}(A,B)\) to compute the set union \(A\cup B\) and \(\texttt {minus}(A,B)\) to compute the set difference \(A\setminus B\). For transition relations, Sylvan provides an operation \(\texttt {relprod}(S,R)\) to compute the successors of S with transition relation R, and an operation \(\texttt {relprodunion}(S,R)\) that computes \(\texttt {union}(S,\texttt {relprod}(S,R))\), i.e., computing the successors and adding them to the given set of states, in one operation. All these operations are internally parallelized, as described in [16].

figure b

We implement multi-core saturation as in Algorithm 1. We have a transition relation disjunctively partitioned into M relations \(R_0\dots R_{M-1}\). These relations are sorted by the level (depth) of the decision diagram where they are applied, which is the first level touched by the relation. We say that relation \(R_i\) is applied at depth \(d_i\). We identify the current next relation with a number k, \(0\le k\le M\), where \(k=M\) denotes “no next relation”. Decision diagram levels are sequentially numbered with 0 for the root level.

The saturate algorithm is given the initial set of states S and the initial next transition relation \(k=0\) and the initial decision diagram level \(d=0\). The algorithm is a straightforward implementation of saturation. First we check the easy cases where we reach either the end of an LDD list, where \(S=0\), or the bottom of the decision diagram, where \(S=1\). If there are no more transition relations to apply, then \(k=M\) and we can simply return S. When we arrive at line 4, the operation is not trivial and we consult the operation cache.

If the result of this operation was not already in the cache, then we check whether we have relations at the current level. Since the relations are sorted by the level where they must be applied, we compare the current level d with the level \(d_k\) of the next relation k. If we have relations at the current level, then we perform the fixed point computation where we first saturate S for the remaining relations, starting at relation \(k'\), which is the first relation that must be applied on a deeper level than d, and then apply the relations of the current level, that is, all \(R_i\) where \(k\le i<k'\). If no relations match the current level, then we compute in parallel the results of the suboperations for the LDD of successor “right” and for the LDD of successor “down”. After obtaining these sub results, we use LookupLDDNode to compute the final result for this LDD node. Finally, we store this result in the operation cache and return it.

The do in parallel keyword is implemented with the work-stealing framework Lace [18], which is embedded in Sylvan [16] and offers the primitives spawn and sync to create subtasks and wait for their completion. The implementation using spawn and sync of lines 12–14 is as follows.

figure c

The implementation of multi-core saturation for BDDs is identical, except that we parallelize on the “then” and “else” successors of a BDD node, instead of on the “down” and “right” successors of an LDD node.

To add on-the-fly transition relation learning to this algorithm, we simply modify the loop at line 9 as follows:

figure d

The learn-transitions function provided by LTSmin updates relation i given a set of states S. The function first restricts S to so-called short states \(S^i\), which is the projection of S on the state variables that are touched by relation i. Then it calls the next-state function of the PINS interface for each new short state and it updates \(R_i\) with the new transitions.

Updating transition relations from multiple threads is not completely trivial. LTSmin solves this using lock-free programming with the compare-and-swap operation. After collecting all new transitions, LTSmin computes the union with the known transitions and uses compare-and-swap to update the global relation; if this fails, the union is repeated with the new known transitions.

4 Contributed Tools

We present several new tools and extensions to existing tools produced in this work. The new tools support experiments and comparisons between various DD formats. The extension to Sylvan and LTSmin provides end-users with multi-core saturation for reachability analysis.

4.1 Tools for Experimental Purposes

For the empirical evaluation, we need to isolate the reachability analysis of a given LDD (or BDD or MDD). To that end, we implemented three small tools that only compute the set of reachable states, namely for LDDs, for BDDs and for MDDs using the library Meddly. These tools are given an input file representing the model, compute the set of reachable states, and report the number of states and the required time to compute all reachable states. Additionally we provide the tools and that convert an LDD file to a BDD file and to an MDD file. The LDD input files are generated using LTSmin (see below). These tools can all be found onlineFootnote 1.

4.2 Tools for On-The-Fly Multi-core Saturation

On-the-fly multi-core saturation is implemented in the LTSmin  toolset, which can be found onlineFootnote 2. The examples in this section are also onlineFootnote 3. On-the-fly multi-core saturation for Petri nets is available in LTSmin’s tool . This tool computes all reachable markings with parallel saturation. The command line to run it on Fig. 1 is . The tool reports: . Additionally, it appears the final LDD has 16 nodes.

Here the syntactic variable order of the places in is used. To use a better variable order, the option is added to the command line. For instance adding runs Force, while runs Sloan’s algorithm (as implemented in the well-known Boost library). Running reports that the final LDD has only 12 nodes.

The naming convention of LTSmin’s binaries follows the Partitioned Next-State Interface (PINS) architecture [5, 22, 25]. PINS forms a bridge between several language front-ends and algorithmic back-ends. Consequently, besides , LTSmin also provides and several other combinations. These binaries generate the state space for the languages PNML, DVE and Promela, by means of distributed explicit-state, multi-core explicit-state and multi-core symbolic algorithms, respectively. Additionally, LTSmin supports checking for deadlocks and invariants, and verifying LTL properties and \(\mu \)-calculus formulas. In this work we focus on state space generation with the symbolic back-end only.

We now demonstrate multi-core saturation for Promela models. Consider the file which is an implementation of the GARP protocol [23]. To compute the reachable state space with the proposed algorithm and Force order, run: . On a consumer laptop with 8 hardware threads, LTSmin reports 385,000,995,634 reachable states within 1 min. To run the example with a single worker, run . On the same laptop, the algorithm runs in 4 min with 1 worker. We thus have a speedup of 4\(\times \) with 8 workers for symbolic saturation on a Promela model.

5 Empirical Evaluation

Our goal with the empirical study is five-fold. First, we compare our parallel implementation with only 1 core to the purely sequential implementation of the MDD library Meddly [4], in order to determine whether our implementation is competitive with the state-of-the-art. Second, we study parallel scalability up to 16 cores for all models and up to 48 cores with a small selection of models. Third, we compare parallel saturation with LDDs to parallel saturation with ordinary BDDs, to see if we get similar results with BDDs. Fourth, we compare parallel saturation without on-the-fly transition learning to on-the-fly parallel saturation, to see the effects of on-the-fly transition learning on the performance of the algorithm. Fifth, we compare parallel saturation with other reachability strategies, namely chaining and BFS, to confirm whether saturation is indeed a better strategy than chaining and BFS.

To perform this evaluation, we use the P/T Petri net benchmarks obtained from the Model Checking Contest 2016 [24]. These are 491 models in total, stored in PNML files. We use parallel on-the-fly saturation (in LTSmin) with a generous timeout of 1 hour to obtain LDD files of the models, using the Force variable ordering and using the Sloan variable ordering. In total, 413 of potentially 982 LDD files were generated. These LDD files simply store the list decision diagrams of the initial states and of all transition relations. We convert the LDD files to BDD files (binary decision diagrams) with an optimal number of binary variables. We also convert the LDD files to MDD files for the experiments using Meddly. This ensures that all solvers have the same input model with the same variable order.

Table 1. The six solving methods that we use in the empirical evaluation. Five methods are parallelized and one method is on-the-fly.
Table 2. Number of benchmarks (out of 413) solved within 20 min with each method with the given number of workers.

See Table 1 for the list of solving methods. As described in Sect. 4, we implement the tools lddmc, bddmc and medmc to isolate reachability computation for the purposes of this comparison, using respectively the LDDs and BDDs of Sylvan and the MDDs of Meddly. The on-the-fly parallel saturation using LDDs is performed with the pnml2lts-sym tool of LTSmin. We use the command line pnml2lts-sym ORDER --lace-workers=WORKERS --saturation=sat FILE, where ORDER is -rf for Force and -rbs for Sloan and WORKERS is a number from the set \(\{1,2,4,8,16\}\).

All experimental scripts, input files and log files are available online (see footnote 3). The experiments are performed on a cluster of Dell PowerEdge M610 servers with two Xeon E5520 processors and 24 GB internal memory each. The tools are compiled with gcc 5.4.0 on Ubuntu 16.04. The experiments for up to 48 cores are performed on a single computer with 4 AMD Opteron 6168 processors with 12 cores each and 128 GB internal memory.

When reporting on parallel executions, we use the number of workers for how many hardware threads (cores) were used.

Overview. After running all experiments, we obtain the results for 413 models in total, of which 196 models with the Force variable ordering and 217 models with the Sloan variable ordering. In the remainder of this section, we study these 413 benchmarks. See Table 2, which shows the number of models for which each method could compute the set of reachable states within 20 min.

Table 3. Cumulative time and parallel speedups for each method-#workers combination on the models where all methods solved the model in time. These are 301 models in total: 151 models with Force, 150 models with Sloan.

To correctly compare all runtimes, we restrict the set of models to those where all methods finish within 20 min with any number of workers. We retain in total 301 models where no solver hit the timeout. See Table 3 for the cumulative times for each method and number of workers and the parallel speedup. Notice that this is the speedup for the entire set of 301 models and not for individual models.

Comparing LDD saturation with Meddly’s saturation. We evaluate how ldd-sat with just 1 worker compares to the sequential saturation of Meddly. The goal is not to directly measure whether there is a parallel overhead from using parallelism in Sylvan, as the algorithm in lddmc is fundamentally different because it uses LDDs instead of MDDs and the algorithm does not in-place saturate nodes, as also explained in Sect. 3. The low parallel overheads of Sylvan are already demonstrated elsewhere [15, 16, 18]. Rather, the goal is to see how our version of saturation compares to the state-of-the-art.

Table 2 shows that Meddly’s implementation (mdd-sat) and our implementation (ldd-sat 1) are quite similar in the number of solved models. Meddly solves 375 benchmarks and our implementation solves 388 within 20 min.

See Table 3 for a comparison of runtimes. Meddly solves the 150 models with Sloan almost 2\(\times \) as fast as our implementation in Sylvan, but is slower than our implementation for the 151 models with Force. We observe for individual models that the difference between the two solvers is within an order of magnitude for most models, although there are some exceptions. Our implementation quickly overtakes Meddly with additional workers.

Table 4. Parallel speedup for a selection of benchmarks on the 48-core machine (only top 5 shown)

Parallel Scalability. As shown in Table 3, using 16 workers, we obtain a modest parallel speedup for saturation of 6.2\(\times \) (with Sloan) and 4.7\(\times \) (with Force). On individual models, the differences are large. The average speedup of the individual benchmarks is only 1.8\(\times \) with 16 workers, but there are many slowdowns for models that take less than a second with 1 worker. We take an arbitrary selection of models with a high parallel speedup and run these on a dedicated 48-core machine. Table 4 shows that even up to 48 cores, parallel speedup keeps improving. We even see a speedup of 52.2\(\times \). For this superlinear speedup we have two possible explanations. One is that there is some nondeterminism inherent in any parallel computation; another is already noted in [20] and is related to the “chaining” in saturation, see further [20].

Comparing LDD saturation with BDD saturation. As Table 3 shows, the ldd-sat and bdd-sat method have a similar performance and similar parallel speedups.

On-the-fly LDD saturation. Comparing the performance of offline saturation with on-the-fly saturation, we observe the same scalability with the Sloan variable order, but on-the-fly saturation requires roughly 2\(\times \) as much time. With the Force variable order, on-the-fly saturation is slower but has a higher parallel speedup of 7.9\(\times \).

Comparing saturation, chaining and BFS. We also compare the saturation algorithm with other popular strategies to compute the set of reachable states, namely standard (parallelized) BFS and chaining, given in Fig. 5. As Tables 2 and 3 show, chaining is significantly faster than BFS and saturation is again significantly faster than chaining. In terms of parallel scalability, we see that parallelized BFS scales better than the others, because it can already parallelize in the main loop by computing successors for all relations in parallel, which chaining and saturation cannot do. For the entire set of benchmarks, saturation is the superior method, however there are individual differences and for some models, saturation is not the fastest method.

Fig. 5.
figure 5

Algorithms bfs and chaining implement the Parallel BFS and Chaining strategies for reachability.

6 Conclusion

We presented a multi-core implementation of saturation for the efficient computation of the set of reachable states. Based on Sylvan’s multi-core decision diagram framework, the design of the saturation algorithm is mostly orthogonal to the type of decision diagram. We showed the implementation for BDDs and LDDs; the translation relation can be learned on-the-fly. The functionality is accessible through the LTSmin high-performance model checker. This makes parallel saturation available for a whole collection of asynchronous specification languages. We demonstrated multi-core saturation for Promela and for Petri nets in PNML representation.

We carried out extensive experiments on a benchmark of Petri nets from the Model Checking Contest. The total speedup of on-the-fly saturation is 5.9\(\times \) on 16 cores with the Sloan variable ordering and 7.9\(\times \) with the Force variable ordering. However, there are many small models (computed in less than a second) in this benchmark. For some larger models we showed an impressive 52\(\times \) speedup on a 48-core machine. From our measurements, we further conclude that the efficiency and parallel speedup for the BDD variant is just as good as the speedup for LDDs. We compared efficiency and speedup of saturation versus other popular exploration strategies, BFS and chaining. As expected, saturation is significantly faster than chaining, which is faster than BFS; this trend is maintained in the parallel setting. Our measurements show that the variable ordering (Sloan versus Force), and the model representation (pre-computed transition relations versus learned on-the-fly) do have an impact on efficiency and speedup. Parallel speedup should not come at the price of reduced efficiency. To this end, we compared our parallel saturation algorithm for one worker to saturation in Meddly. Meddly solves fewer models within the timeout, but is slightly faster in other cases, but parallel saturation quickly overtakes Meddly with multiple workers.

Future work could include the study of parallel saturation on exciting new BDD types, like tagged BDDs and chained BDDs [8, 19]. The results on tagged BDDs showed a significant speedup compared to ordinary BDDs on experiments in LTSmin with the BEEM benchmark database. Another direction would be to investigate the efficiency and speedup of parallel saturation in other applications, like CTL model checking, SCC decomposition, and bisimulation reduction.