Efficient GPU algorithms for parallel decomposition of graphs into strongly connected and maximal end components

This article presents parallel algorithms for component decomposition of graph structures on general purpose graphics processing units (GPUs). In particular, we consider the problem of decomposing sparse graphs into strongly connected components, and decomposing graphs induced by stochastic games (such as Markov decision processes) into maximal end components. These problems are key ingredients of many (probabilistic) model-checking algorithms. We explain the main rationales behind our GPU-algorithms, and show a significant speed-up over the sequential (as well as existing parallel) counterparts in several case studies.


Introduction
Strongly connected components (SCCs, for short) are sub-graphs in which each pair of states is mutually reachable.Finding maximal SCCs, i.e., SCCs that are not contained in others, is a key ingredient of various model-checking algorithms.To mention a few, this applies to the standard verification algorithms for CTL-formulas of the form EG ϕ and for verifying fair CTL [5,Chap.6]where so-called fair SCCs play an important role.Checking language emptiness [32] is based on SCCs, and efficient model-checking algorithms for discrete-time Markov chains exploit SCC decomposition [4].The high relevance of SCCs has led to various dedicated variants of Tarjan's classical algorithm [30] such as a symbolic variant [9] and a plethora of parallel algorithms [1,23,26].In the context of probabilistic model checking, a generalisation of SCCs-known as maximal end components (MECs)-play a pivotal role [2,17].Determining MECs is a main step in the verification of qualitative and quantitative properties on Markov decision processes (MDPs) and continuous-time variants thereof.MDPs are an important class of models used for the analysis of probabilistic systems consisting of several components running in parallel.Parallelism is modelled by non-determinism whereas the steps within a component may be probabilistic (e.g., modelling a coin flip).MDP model checking is a very active branch of probabilistic model checking with applications in amongst others planning and randomised distributed algorithms.MECs are maximal strongly connected sub-graphs in which the MDP can ensure to reside when playing against a probabilistic adversary.MEC decomposition of MDPs is a pre-processing step of probabilistic model checking to determining almost-sure limiting properties [5,Chap. 10] such as almost-sure repeated reachability and limiting Rabin acceptance conditions.They are also relevant for checking ω-regular properties on MDPs under fairness constraints [5,Chap. 10].Other applications of MEC decomposition include the analysis of multi-player stochastic games [31], recent approaches to combined worst-case and expected value objectives for mean pay-off games [12], as well as incremental verification techniques for MDPs [24].Algorithms for MEC decomposition are still an active area of research.Improvements of the traditional sequential algorithms for determining MECs [2,5,17] have been reported [13] and were tailored to MDPs with low tree-width [16].Recently, also a first dynamic algorithm has been given that maintains the MEC decomposition of a graph under a sequence of edge insertions or deletions [15].
In this article, we provide new algorithms to efficiently decompose graphs into SCCs and MECs by exploiting GPUs (Graphics Processing Units).Our decomposition algorithms build upon three key principles.First, inspired by the Forward-Backward algorithm (FB) [20], each thread combines a forward and a backward reachability search so as to identify SCCs.Previous work on GPU-based SCC decomposition [1,23,26] identified the FB algorithm (combined with a trimming procedure to remove trivial SCCs) as the best performing one for general input graphs.Opposed to these works, we focus on graphs that are commonly observed in model checking, i.e., sparse graphs with a low average out-degree (number of outgoing transitions per state) and tailor our algorithms to treat these graphs efficiently.The backward and forward search are started from some common state, called the pivot.
The second main principle is to exploit a novel pivot selection strategy which turns out to be simple and efficient.It enforces race conditions between threads to pseudo-randomly select pivots.Compared to our earlier work [36], this strategy has been further optimised in two ways: first of all, we use so-called warp-aggregation to reduce the number of racing threads, and second of all, when used for MEC decomposition, an SCC decomposition procedure following a MEC decomposition iteration can exploit the results obtained so far (as inspired by [13]) to further restrict the number of racing threads.
Finally, we optimise the memory management to achieve coalesced memory access by the individual threads, i.e., data access can be accomplished in a single memory fetch.Altogether this alleviates memory latency and thread divergence where part of the threads execute one branch of the common code, while others take another branch.
The overall memory requirements are significantly lower than for competitive algorithms [1] as besides the input graph G = (V, E), only a single additional integer array of size |V | is needed to store decomposition results.Given the restricted memory size on a GPU, this memory reduction is essential.Our GPU-based MEC decomposition algorithm uses the same principles as the SCC algorithm; it can be viewed as a parallel version of the standard sequential algorithms [2,5,17].To the best of our knowledge, this is the first GPU-based MEC decomposition procedure.We implemented our algorithms using CUDA 1for NVIDIA GPUs, and ran them on examples of the PRISM benchmark suite [25].Speed-up factors of 15-30 and 79 have been achieved for SCC and MEC decomposition, respectively.For SCC decomposition, this is a significant improvement over previous results (e.g.[1]) for sparse graphs with a low average out-degree.
Exploiting general purpose GPUs (GPGPUs, for short) in the setting of model checking is not new.Thanks to efforts of several research groups [6,10,19], GPGPUs have been applied to significantly improve the run times of model checking algorithms.In the context of probabilistic model checking, these improvements usually targeted the numerical part of the algorithms, so as to exploit the inherent advantages of the GPUs [10,11,34].More recently, we presented an on-the-fly search algorithm for standard model checking running entirely on GPUs [35], and how to perform strong and branching bisimilarity checking on GPUs [33].
This article is based on the conference version [36], extending it in the following ways: 1.In the preliminary section: (a) We provide the proofs of all theoretical results.(b) A running example has been added to illustrate the presented notions.(c) Besides the FB algorithm with trimming, we also discuss an alternative version proposed by Bloem et al. [8], which we refer to as BFBT, that offers an advantage over standard FB in terms of overall complexity, but also a disadvantage regarding parallelism.Like FB, we also consider SCC and MEC decomposition based on BFBT.(d) A section has been added in which we reason how the basic algorithms can be adapted for parallel execution.
2. Multiple pseudo-code descriptions have been added, and the accompanying text explains in more detail how the SCC and MEC decomposition procedures work.3. We propose two new optimisations regarding the so-called pivot selection procedure, which is a key step in our GPU decomposition algorithms.4. Finally, the experimental section has been extended to validate the new amendments of the pivot selection and the use of BFBT.
Organisation of the article.Section 2 treats the basics of MDPs, MECs and relevant SCC and MEC decomposition algorithms.Section 3 gives an introduction to the typical architecture of a GPU and the various important aspects and notions relevant for GPU programming and understanding the article.In Sect.4, we discuss related work, specifically focussing on how to perform Breadth-First Search (BFS) efficiently on GPUs.Then, in Sect.5, the various GPU procedures that we have developed to efficiently perform SCC decomposition on GPUs are presented, followed by a discussion in Sect.6 how this approach can be extended to achieve MEC decomposition.Finally, Sect.7 presents the experimental results, and Sect.8 concludes.

Preliminaries
This section gives an introduction to the main concepts of MDPs and MECs [5,Chap.10],presents the parallel FB algorithm for SCC decomposition [20] and the standard sequential algorithm for MEC decomposition [2,17] of MDPs.

Markov decision processes and maximal-end components
Let Δ(X ) denote the set of probability distributions over the countable set X , i.e., the set of functions μ : X → [0, 1] with x∈X μ(x) = 1.
Definition 1 (Markov decision process) A Markov decision process (MDP) is a tuple M = (S, ŝ, T ), where S is a finite set of states, ŝ ∈ S is the initial state, and T : S → 2 Δ(S) is the transition function with T (s) = ∅ and T (s) is finite for all s ∈ S.
The transition function T maps every state s ∈ S to a finite, non-empty set of distributions over S. In state s, one of the distributions in μ ∈ T (s) is selected non-deterministically, and the MDP evolves to state s with probability μ(s ).As T (s) is non-empty for every state, this procedure can be repeated ad infinitum.For state s, T (s) can be viewed as the set of distributions that are selected in a non-deterministic manner.Alternatively, an MDP can be considered as a single-player game in which the system plays against a random adversary.
An MDP naturally induces a digraph in the following sense.
Example 1 Consider the MDP depicted in Fig. 1 with S = {s 0 , . . ., s 7 }, ŝ = s 0 and e.g., Intuitively speaking, there is a μ-labelled edge between two vertices (states) u and v whenever v is in the support of distribution μ in T (u).For node u and distribution μ, We call E μ (u) the set of target vertices (states) of the source vertex (state) u under distribution μ.Moreover, let E(u) = μ E μ (u).For labelled digraphs we adopt the standard graph-theoretical notions like paths, cycles, components, etc..An MDP graph G = (V, E) is strongly connected iff for every two vertices u, v ∈ V there is a path from u to v and a path from v to u.The set of nodes C ⊆ V is a strongly connected component (SCC) Fig. 1 Example MDP.The corresponding induced graph is obtained by omitting the numbers behind the actions in the transition labels In the sequel, unless stated otherwise, we use the abbreviation SCC for maximal SCCs.In the following, let G = (V, E) be an MDP graph.
It is convenient to distinguish vertices that are potentially "closed" in the sense that for at least one non-deterministic choice (distribution) all transitions remain within a given set.For the description of the MDP algorithms (below) we define the notion of attractor.Stated in words, an attractor is a set of vertices in which the MDP may reside with positive probability no matter which distributions are non-deterministically selected.
Definition 7 (Attractor) The attractor Attr(U ) of U ⊆ V is defined as Attr(U ) = i≥0 U i where U i is defined inductively by: -U 0 = U , and The attractor Attr(U ) contains U plus all vertices from which (the vertices in) U can be reached via at least one transition regardless of the resolution of the non-deterministic choices by the adversary.
Example 3 Consider again the running example MDP.It follows that {s 4 , s 5 , s 6 } is a MEC, as well as {s 2 } and {s 3 }.These are the only MECs in the running example.For U = {s 3 } we have Attr(U ) = {s 1 , s 3 }.
The MEC-decomposition algorithm discussed later on exploits the following two results from [14].(The formulation of the lemmata and the corresponding proofs are adapted to our definition of MDP graphs.)The first result identifies some of the vertices that do not belong to any MEC and thus can be removed without affecting the MEC decomposition of the rest of the MDP graph.
Since X and C share nodes and are both strongly connected, it follows that they belong to the same SCC, which because of the maximality requirement is C itself, hence X ⊆ C. Now it suffices to show that X ∩ Attr(U ) = ∅.This is done by contraposition.Using the structure of the attractor definition (Definition 7) we show by induction that X ∩ U i = ∅, for all i.The base case for U 0 = U holds because X ∩ U = ∅ would imply that X is not e-closed.For the inductive step, suppose that, for some k, X ∩ U k = ∅ and X ∩ U k+1 = ∅.Let us consider a node u ∈ X ∩ U k+1 .By Definition 7, we have ∀μ.E μ (u)∩U k = ∅.But this is impossible since it contradicts the induction hypothesis.Therefore, X ∩ U k+1 = ∅, which establishes the inductive step of the proof.2. In case X = C the claim follows directly from Z = Attr(C)\C.So, we assume X = C.The proof follows a similar inductive pattern like for the first item.We show that each set U i in Definition 7 is disjoint from X .For the base step U 0 = C, since X = C and X and C as SCCs are by definition disjoint, we have X ∩ U 0 = ∅.Suppose that, for some k, X ∩ U k = ∅ and X ∩ U k+1 = ∅.As above, using Definition 7 we conclude that for some u ∈ X ∩ U k+1 ∀μ.E μ (u) ∩ U i = ∅, which contradicts the fact that X , being a MEC, must be e-closed.The second result from [14] provides a sufficient criterion for an SCC to be a MEC.Lemma 2 below formally establishes the fact that every bottom SCC, i.e., an SCC C such that all transitions from C lead back to C, is a MEC.
Proof From the premise of the lemma we have that C is closed and consequently it is also e-closed.Since C is also an SCC, it follows that C is a MEC.

SCC decomposition using forward-backward search
Many algorithms exist to perform SCC decomposition.Linear-time algorithms such as the ones by Tarjan [30] and Dijkstra [18] are based on depth-first search and thus very hard to parallelize, especially when the goal is to run thousands of threads in parallel as is the case with GPUs.An alternative for SCC decomposition is the Forward-Backward algorithm (FB, for short) proposed by Fleischer et al. [20].A similar approach was proposed by Xie and Beerel for symbolic model checking, but without noting the potential for parallel execution [37].This algorithm is based on a breadth-first search (BFS) strategy, combining a forward and a backward search.It has worst-case complexity O(|V | 2 + |V | • |E|), but offers great potential for GPU-based parallelization.
The Forward-Backward algorithm of Fleischer et al. is presented in Algorithm 1, with two modifications: first of all, a so-called trimming procedure has been added at line 1, which is discussed later in this section.Because of this, we refer to this algorithm from now on as FBT.Second of all, besides a graph G = (V, E), it also takes as input a candidate set of vertices J ⊆ V .The algorithm starts by (randomly) selecting a pivot vertex p (see Algorithm 1, line 3) from J .The SCC to which p belongs is then found by performing both a forward BFS and a backward BFS starting from p, to determine the forward and backward closure (of p), respectively (Algorithm 1, lines 4 and 5).The intersection of the vertices reached via the forward and backward BFSs constitutes an SCC (and is removed, Algorithm 1, line 6).The graph vertices are then partitioned into the vertices belonging only to the forward closure, those only in the backward closure, and those outside both closures.These subsets are referred to as search regions.Subsequently, FBT can be invoked recursively in parallel on the three search regions.This can be done, since all other, not yet detected SCCs, are contained in one of these search regions.
A necessary condition for the correctness of the FBT algorithm is that set J does not become ∅ as long as at least one of the generated search regions is not empty, i.e., at least one recursive call of FBT can be made in lines 8-10 with a non-empty vertex set.Initially the algorithm is called with J = V 0 , where V 0 is the set of vertices in the initial graph.The non-emptyness condition can be trivially fulfilled by setting for each recursive call J = V , where V is the set of vertices of graph G on which the FBT algorithm is applied, i.e., the input graph.Later, in the context of the MEC algorithm (Sect.2.4), we present an alternative choice for J , which in fact is the motivation for us introducing a candidate set to FBT in the first place.
As previously mentioned, the FBT algorithm involves a trimming step [27] (see Algorithm 1, line 1).This step eliminates the trivial SCCs consisting of a single vertex.The trimming procedure exploits topological sort elimination by starting in a vertex with zero in-or out-degree.As such vertex cannot be a part of a non-trivial SCC, they can be safely removed to avoid using them as pivots in the FBT search.Since the removal can create other trimming candidates, the procedure is iterated (in the method Trim(V ) in Algorithm 1) until there are no vertices for trimming left.Trimming is also used in our parallel SCC algorithm.Several studies [1,23,26] have shown that parallel SCC decomposition algorithms including Coloring heads off [29] and Recursive OBF [7], show inferior performance compared to the FBT algorithm.
Bloem et al. [8] presented an optimisation of Forward-Backward BFS as presented by Xie & Beerel.The optimisation takes into account that whenever either the forward or backward BFS has finished, but the other has not, then the latter can be restricted to those states that have been visited by the former.A version of FBT with this optimisation is described in Algorithm 2, and we refer to it as Bounding FBT (BFBT).In this version of the algorithm, we refer with Ffront, Bfront to the search frontier of the forward and backward BFS, respectively.Functions fwdBfsIteration and bwdBfsIteration perform one iteration of the forward and backward BFS, respectively, by moving the respective frontier to neighbouring states.Whenever one search has finished (condition of while loop at lines 6-8), the set Converged will contain all the states that were reached in the search that finished (lines 9-12).Next, either lines 13-14 or lines 15 and 16 will be executed, depending on which search still has work to do.Note that these searches are bounded to the area that has been searched by the search that terminated (conditions at lines 13 and 15).
BFBT has a clear advantage when the input graph has a structure where an FBT search would typically finish either one of the searches much sooner than the other.On the other hand, lines 19 and 20 show that there is less potential for increasing the number of independent searches; in BFBT, due to the fact that one of the searches did not run its course, we can only launch up to two new BFBT instances in newly discovered regions, as opposed to three in FBT (Algorithm 1, lines 8-10).In Sect.7, we report on experimental results using both FBT and BFBT, which makes it possible to draw some conclusions regarding their performance.

Sequential MEC decomposition algorithms
The basic sequential algorithm for MEC decomposition of MDP graph G = (V, E) is based on an iterative SCC decomposition of G followed by transforming the SCCs into MECs [2,5,17].The algorithm consists of the following stages: Lemma 2).As justified by Lemma 1.2, remove Attr(C) for every C for which we established that C is a MEC.
3. Recursively compute the MEC-decompositions of the sub-MDP graphs obtained after the removal of the vertices in steps 2 and 3. (This is needed since the removal of the vertices might have destroyed the strong connectivity of some of the components.) The first step of the algorithm, i.e., the SCC decomposition of the MDP graph, can be done in O(m) time, where m = |E| is the number of edges, e.g., using, e.g., Tarjan's algorithm [30].The second step can be done in O(m) time.There are at most n = |V | iterations implied by step 3, since in each iteration at least one vertex is removed.This yields an overall time complexity of O(m • n).Recent works [14] and [16] present an adapted MEC-decomposition algorithm with time complexity where k is the so-called tree width of G.We base our GPU algorithm on the basic algorithm, since the recent algorithms involve steps that seem very hard to perform within the many-core paradigm of GPUs, like the lock-step search phase of [14].

Towards a parallel algorithm for MEC decomposition
In the parallel version of the MEC algorithm we would like for SCC decomposition to use (a version of) Algorithm 1. Previously, it was assumed that J = V .However, on a GPU, it is particularly hard to let all the threads together make some random decision.This is explained in detail in Sect.5.3.In fact, the smaller the number of threads that need to make such a decision, the better.Since on the GPU, threads will be mapped to states on a one-to-one basis, this means that it could be advantageous to select an alternative candidate set J ⊆ V which is still non-empty, but as small as possible.
To achieve this, we use a modification of the (basic) MEC algorithm which is inspired by the sequential algorithm for MEC decomposition from [13].Consider a recursive call of the MEC decomposition algorithm applied to graph G = (V, E).Let G = (V , E ) be a region, i.e., a subgraph of G obtained after the removal of the SCC attractors in Step 2 of the MEC algorithm and to which we are about to apply a recursive MEC (SCC) decomposition in Step 4. Let L denote the set of nodes that have been removed in the last (most recent) iteration of the algorithm in Step 2 of the MEC algorithm, after the SCC decomposition in Step 1.Let J be the "join" set of all nodes u such that E(u) ∩ L = ∅.
For the correctness of the algorithm, we observe that the set J has to remain non-empty until all MECs of the input graph (V, E) are found.If in Step 2, L = ∅, then all found SCCs are MECs (since set U of vertices with outgoing transitions which are not e-closed is empty) and we are done.If L = ∅ then, since each vertex v ∈ L is originally part of a (non-trivial) SCC C, there must be an edge (u, v) with u ∈ C \ L and consequently u ∈ J .
However, set J does not have to be defined in such a way that it is guaranteed that each SCC decomposition procedure is complete; not all SCCs need to be identified by the decomposition procedure, and in fact, this is not guaranteed by J as it is defined above.
The modified version of the MEC algorithm that exploits the new definition of J is as follows, with initially J = V : (1) Compute FBT(V, E, J ).
(2) For each SCC C: Lemma 2).As justified by Lemma 1.2, remove Attr(C) for every C for which we established that C is a MEC.
(3) Compute sets L of the nodes that were removed in Step 2 and assign to J the set of the vertices that have an edge to a node in L. Recursively compute the MEC-decompositions of the sub-MDP graphs obtained after the removal of the vertices in steps 2 and 3. (This is needed since the removal of the vertices might have destroyed the strong connectivity of some of the components.) The main motivation behind using a smaller subset of V (V ) for J as defined above is to possibly accelerate the pivot finding in the SCC decomposition step.We revisit this issue later in Sect.5.3.Finally, it has to be noted that since J as defined above does not guarantee that the SCC decomposition steps produce complete results, using J may lead to fewer search regions being discovered in search iterations, and therefore to less potential for parallelism.We report on our findings regarding this in Sect.7.

GPU basics
Harnessing the power of GPUs is facilitated by specific Application Programming Interfaces.In this article, we assume a concrete NVIDIA GPU architecture and the Compute Unified Device Architecture (CUDA) interface.Nevertheless, the algorithms that we present here can be straightforwardly applied to any architecture which provides massive hardware multithreading, supports the SIMT (Single Instruction Multiple Threads) model, and relies on coalesced access to the memory.
CUDA is an interface by NVIDIA which is used to program GPUs.CUDA extends C and Fortran.We use the C extension.GPU-specific features of CUDA include special declarations to explicitly place variables in the various types of memory (see Fig. 2), predefined keywords containing the IDs of individual threads and blocks of threads, synchronization statements for cooperation between threads, run time API for memory management (allocation, deallocation), and statements to launch functions, referred to as kernels, on a GPU.In this section we give a brief overview of CUDA, adequate for presenting our results in subsequent sections.More details can be found in, for instance, [10,35].

CUDA programming model
A CUDA program consists of a host program which runs on the Central Processing Unit (CPU) and (a collection of) CUDA kernels.Kernels, which describe the parallel parts of the program, are executed many times in parallel by different threads on the GPU device, and are launched from the host.Most GPUs have the restriction that at most one kernel can be launched at a time, but there are also GPUs available that allow to run multiple different kernels on different threads.When launching a kernel, the number of threads that should execute it needs to be specified.All those threads execute the same kernel, i.e. code.Since CUDA 5.0, dynamic parallelism is supported, meaning that besides the host launching

BlockId
The unique ID of a block

ThreadId
The unique ID of a thread w.r.t.its block (between 0 and BlockSize − 1)

Global − ThreadId
The globally unique ID of a thread (between 0 and NrOfThreads − 1)

Lane
The unique ID of a thread w.r.t.its warp (between 0 and 31) kernels, also GPU threads can do this; in all cases, though, the number of threads that need to execute the kernel needs to be specified.A thread is executed by a streaming processor (SP), see Fig. 2. In general, GPU threads are grouped in blocks of a predefined size, usually a power of two.We refer to this size with BlockSize.A block of threads is assigned to a multiprocessor.Each thread block is uniquely identifiable by its block ID (referred to with the keyword BlockId) and analogously each thread is uniquely identifiable by its thread ID (keyword ThreadId) within its block.Using these, it is possible to define other IDs, such as the GPU-global thread ID Global − ThreadId = (BlockId • BlockSize) + ThreadId.The total number of threads running is defined by NrOfThreads.For clarity, an explanation of these and additional terms is given in Table 1.

CUDA memory model
Threads have access to different kinds of memory.Each thread has its own on-chip registers, access to which is very fast.Moreover, threads within a block can communicate via the shared memory of a multiprocessor, which is on-chip and also very fast.If multiple blocks are executed in parallel then the shared memory is equally split between them.All blocks have access to the global memory which is relatively large (usually up to 5 GB), but slow, since it is off-chip.Two caches called L1 and L2 are used to cache data read from the global memory.The host has read and write access to the global memory.Thus, the global memory is used for communication between the host and the kernel.

GPU architecture
As already mentioned, the architecture of a GPU features a set of streaming multiprocessors (SMs).Each of those contains a set of SPs.The NVIDIA Kepler K20m, which we used for our experiments, has 13 SMs, each consisting of 192 SPs, which gives in total 2496 SPs.Furthermore, it has 5 GB global memory.

CUDA execution model
Threads are executed using the SIMT model.This means that each thread is executed independently with its own instruction address and local state (registers and local memory), but their execution is organized in groups of 32 called warps (see Table 1).The threads in a warp execute instructions in a synchronous manner, meaning that they move through the code in lock-step.This limits the possibilities for data races, but it also means that so-called divergence of thread executions can negatively impact performance of the computation.Consider the if-then-else construct if C then A else B. If the threads in a warp start executing this, and there are both threads for which C holds and threads for which it does not, then all the threads will together step through both alternatives A and B. The ones that do not need to execute A (or B) will have to 'go along' due to the SIMT model, but they will not actually execute it.Avoiding thread divergence is one of the main worries when implementing a program for the GPU.
Similarly, memory accesses of the threads in a single warp are serialized when they need to access separate parts of the global memory.If these accesses can be grouped together physically, i.e. if the accesses are coalesced, then the data can be obtained using a single fetch, thereby greatly improving the runtime.Hence, global memory access should be coalesced as much as possible.This is orthogonal to the fact that in graph decomposition algorithms, accessing transitions is irregular.Thus, achieving coalesced access is non-trivial.For sparse graphs, we propose a technique to reduce irregular memory access later in this section.

Related GPU implementations
Sparse graphs are usually stored in a format based on the Compressed Sparse Row (CSR) format.An integer array trans of size |E| is used to store all the transitions, in order of the source state IDs, and an array offsets consisting of |V | + 1 integers provides the start and end indices of the outgoing transitions of each source state, e.g. for state i, its outgoing transitions are stored in trans from position offsets[i] up to and including offsets[i + 1] − 1.This encoding differs from CSR only in the fact that the CSR format is originally used to store sparse matrices, in which the row indices of the non-zero elements are also required, and are stored in a third array.In our case, this is not needed, but for clarity, we still refer to the format as CSR.The CSR representation of the graph induced by the MDP in Fig. 1, without the actions and probabilities, is given in Fig. 3.
The usual approach to perform a BFS-like search through a CSR description on a GPU involves the threads repeatedly scanning the offsets array using their ID, as in [22].Such a GPU based algorithm is given in Algorithm 3. Note that GPU specific notions such as NrOfThreads have been defined in Sect.3. Two of the three highest bits in the offsets entries indicate whether the corresponding state is 1) in the search frontier or not and 2) has been explored or not.
Variable stepsize in line 1 of Algorithm 3 is related to input restructuring, i.e., an optimized version of the graph representation via trans, discussed in Sect.5.1.Throughout this version 123 Fig. 3 Example Compressed Sparse Row format graph storage.The transitions in trans are determined by their destination states.The arrays encode the graph induced by the MDP in Fig. 1 without the actions and probabilities.One can see, for instance, that state s 0 is a source of transitions to s 1 , s 2 , and s 4 , and from s 1 there is one transition to s 1 , and there are two transitions to s 2 and two transitions to s 3 of Algorithm 3 we assume stepsize = 1.In order to check the status of state i, the source state of the transitions we are going to explore, we copy offsets[i] to srcinfo (line 3).We check if state i belongs to the frontier (line 4) by inspecting the highest bit of the variable srcinfo.If state i is in the frontier it is marked as explored by resetting the highest bit and setting the second highest bit of offsets [i].After that all transitions of state i are explored.To this end first the offset interval corresponding to the transitions of i is established in lines 6 and 7.After that all transitions are inspected to possibly generate new frontier states (lines 8-14).It is checked at line 10 whether transition t has the special value empty.This is related to the optimisation described in Sect.5.1, and can be ignored for now.The target state tgtstate of the inspected transition is extracted from t in line 11 and a copy of the offsets entry for tgtstate is saved in tgtinfo.In line 13 it is checked if the target state is new, i.e., it has not been visited yet.If this is the case, it is added to the frontier by setting the highest bit of the corresponding offsets entry.Note that any possible occurring data races due to multiple threads reaching the same successor state simultaneously can be considered benign; every thread executing line 14 tries to update offsets[tgtstate] in the same way, namely by setting the highest bit.
Such an approach to BFS requires many complete scans of offsets to detect the current frontier and explore states.Since global memory is slow, this is a major performance bottleneck.
Li et al. [26] remark that a GPU BFS which avoids a one-to-one mapping between threads and nodes is preferable over the standard quadratic approach.In other words, approaches like the one of Merrill et al. [28], which uses a work queue, would be preferable.An important reason is that many threads otherwise idle, and with large differences in the out-degree of nodes, work imbalance tends to occur.With sparse matrices such as those underlying MDPs, however, this is not a big concern.The out-degree of most states tends to be similar, and small.In fact, in [3], an implementation of Merrill's approach does not result in further speedups for model checking problems, but it does require more memory.Therefore, we opt for the standard approach to do BFS on a GPU.
Pivot selection is an important step in SCC decomposition, which is non-trivial to implement efficiently on a GPU, since all threads need to agree on the pivots used for the newly discovered regions before launching new BFSs, and the regions need to be distinguishable by means of unique IDs.Several elaborate schemes for this have been presented.In [1], an additional array of size |V | is used, and all threads assigned to states in regions that need to be searched try to write their ID to a common entry in this array.Determining which entry should be targeted is done using a region counting scheme and renumbering heuristics.Also in [26], such an array is used, but instead of racing to entries, a random number generator is implemented, state IDs are written to designated entries, and a prefix sum is used to count the number of new regions.Finally, Hong et al. [23] maintain set representations while doing the forward and backward BFSs, and use these to select pivots.We claim that our solution, which we explain in this section, is more elegant than earlier attempts, and at least as efficient.Instead of essentially trying to use a region counter, we simply use the pivot IDs themselves to identify regions, and our procedure requires no additional memory, instead using the results and trans arrays.
In addition to our new pivot selection, we also contribute compared to earlier work by using SM local caching of states, and restructuring the input to increase the number of coalesced memory accesses.Finally, we merge the frontier and explored set representations with the graph representation, thereby being more economic with the memory, and avoiding memory lookups.

Data representation
For the encoding of a transition, first of all note that for our problems, the probability distributions in MDP graphs are not relevant, only 1) the target states, and 2) the distribution group a transition belongs to.In our implementations, we desire to work with 32-bit integers, as opposed to 64-bit integers, since it allows more efficient use of the global memory on our GPU.Hence, we assume that for each transition, an encoding of the group and the target state together fits in a 32-bit integer.Our program actually checks this: first, it is determined for the input what the maximum number of groups per source state is, say m.Then, the log(m) highest bits of each transition integer are reserved for the group encoding.
To produce the desired output, i.e. the SCC decomposition, we allocate memory for another integer array results of size |V |.After decomposition, its content indicates which states belong to which SCC. Any two states i, j belong to the same SCC iff results[i] = results [ j].
Besides the original input, when memory allows, we also store the transposed MDP graph on the GPU.Since the representation tailored for a (forward) BFS, the transposed graph will be for a backward BFS.If there is not enough memory, then a kernel is available for scanning offsets and trans to perform a backward BFS, which is possible, but requires more memory accesses.
Finally, for bookkeeping purposes, we reserve the three highest bits in each entry of offsets and results.One bit of each entry i is used to indicate that state i is no longer involved in the current search iteration, i.e. it is already identified as part of a component.The two remaining bits in offsets and results entries are used to keep track of the search frontier and the set of explored states in the forward and the backward BFSs, respectively.We reason that this is acceptable: with this restriction, it is still possible to refer to 2 29 states, i.e. about 537 million states.For a graph to be decomposed by our GPU implementation, 2 • |V | + + integers are needed if the transposed graph is not stored, and 3 The vast majority of currently available GPUs have at most 5 GB global memory, which allows up to 1.3 billion integers to be stored, hence 29 bits is sufficient on those GPUs to refer to all the states of a graph that can be handled.In the future, when larger amounts of global memory is readily available (for instance, the NVIDIA Kepler K40 already has 12 GB memory), one can still switch from 32-bit to 64-bit integers to use our implementation for larger input graphs.

Restructuring input for coalesced memory access
In a BFS iteration, offsets are read in a coalesced way by the fact that the threads in a warp, with consecutive IDs, access an uninterrupted range of offsets.For the transitions in trans, though, this is a different matter, which is illustrated on the left in Fig. 4. For the sake of clarity, we assume in this example that the warp size is 3, and we consider three states s 0 to s 2 , and their outgoing transitions.In the figure, transition t 00 is the first outgoing transition of state s 0 , t 10 is the first one of state s 1 , and so on.Since the transitions are stored in separate blocks in trans, it is clear that access to trans will not be coalesced.
To fix this, we interleave the transition entries such that for all the states assigned to a warp, their first transitions are stored in an uninterrupted block, followed by all the second transitions, and so on.This allows to fetch transitions in a coalesced way.The drawback of this is that padding might be required to ensure that each thread accesses the same number of entries.On the right of Fig. 4, the interleaved version of the example is given.We call a block of transitions ordered in this fashion which is assigned to a warp a segment.The amount of padding required is calculated per segment, so the locally maximum number of outgoing transitions determines the padding.The end of a segment is indicated in offsets at the first position of the next segment.
To avoid extensive padding, though, we use a hybrid representation.For a user-defined out-degree upper-bound u, which we call the segment interval, all the states with at most u outgoing transitions are renumbered to appear in the first part of offsets and trans, and all the other states are placed at the tail end.In the corresponding first part of trans, restructuring is applied, but on the tail part it is not.This allows to avoid that states with unusually many transitions cause the introduction of too many padding entries in the segment they are assigned to.For example, a hybrid representation with u = 2 of the example in Fig. 4 is given in Fig. 5.In that case, the amount of padding is reduced from three entries, as on the right in Fig. 4, to only one entry.The drawback, though, is that accessing the transitions in the hybrid repre-Fig.4 Fetching transitions before and after restructuring 123 Fig. 5 Hybrid structure with u = 2, combining interleaved and non-interleaved storage sentation tends to require more separate data accesses.In the example, the transitions in the fully interleaved list can be read in three accesses, while the hybrid list requires five accesses.
In Algorithm 3, restructured transitions are supported when setting stepsize to the warp size.Working with a hybrid representation would involve checking whether the transitions of a state are stored in a segment or not before exploring them, and setting stepsize appropriately.

GPU search mechanism
To illustrate our implementation of FBT for GPUs, we will discuss some of its more interesting aspects.Essentially, every step of Algorithm 1 is parallelised by means of a separate kernel.In addition to this, we also have a kernel for the combination of lines 4 and 5, i.e. the BFSs.In this hybrid kernel, iterations of both BFSs are performed simultaneously during a single scan of the offsets.
Algorithm 4 describes the GPU forward BFS.A local cache is allocated in shared memory.The size of this cache is defined in the host code, i.e. externally, as its declaration mentions.Its contents is initialised as empty.At lines 3-7, the offsets entries assigned to the executing thread are read and checked.
The approach to BFS as given in Algorithm 3 requires many complete scans of offsets to detect the current frontier and explore states.Since global memory is slow, this is a major performance bottleneck.To mitigate this, we have opted for using SM local state caches residing in the shared memory.The gpu-fwdBfs kernel accepts a given number of iterations NrIters.In the first iteration (lines 3-7), the usual scanning is performed, but in addition to being added to the frontier in the global memory, newly discovered states are added to the cache.After the first iteration, lines 8-13 are executed, in which the cache is scanned for exploration work.
In Algorithm 5, the explore procedure is described, which is in the implementation actually directly integrated with gpu-fwdBfs.First, stepsize is defined depending on whether 123 the transitions belonging to state i to be explored reside in a segment or not.At line 4, srcregion stores the FBT search region (see Sect. 2.2) to which state i belongs.Starting at line 7, the successors of i are read.At line 12, isActive checks if the search region of the target state of transition t has already been identified as an SCC in a previous round.This is the case if both the second and highest bits of results[ j] are set.If it is not part of a detected SCC, and both the source and target state of t are part of the same search region (line 14), where tgtregion represents the search region of the target state (line 13), then the target state is eligible for addition to the frontier.If its offsets entry indicates that the state is newly discovered, then, depending on the current search iteration, the target state is or is not added to the local cache (lines [16][17][18][19][20][21][22].Besides this, nextIter is set, which is read by the host after each search iteration to determine whether another iteration is required.Also, the target state is added to the search frontier (line 22).Finally, in the final iteration, no states are added to the cache, since after the final iteration, kernel execution will stop anyway, and the contents of the shared memory does not survive once a kernel has terminated.
Similar to gpu-fwdBfs, we also have a backward BFS variant operating on the transposed graph, if present, and a backward BFS variant operating on the original graph, which works different from Algorithm 5, since it involves in each iteration checking that from a state, the current frontier can be reached.Keeping track of the contents of the frontier and the set of explored states is done by using the bookkeeping bits in results.Besides this, we have a hybrid approach, in which both an iteration of the forward BFS and the backward BFS is performed.All these different versions allow to manage at the host level which searches should be performed in the next iteration, based on the feedback given by the threads.

BFBT
Inspired by the optimisation described in BFBT (Algorithm 2) regarding the bounding of one BFS once the other has terminated, we implemented a similar optimisation, which differs from the original one in that it refers to the global state of the SCC decomposition as a whole, as opposed to individual BFBT searches.This global version can be implemented straightforwardly by changing the condition for exploration of individual states in the appropriate procedures.For instance, in the forward BFS, we add a condition that a state may only be explored if it has already been explored in the backward BFS.This condition would be added to lines 5 and 14 in Algorithm 4. In the backward BFS code, we add a similar condition regarding the exploration history of the forward BFS.The hybrid search is left unchanged.These procedures can then be used as follows: we repeatedly apply the hybrid procedure on all the states in the graph, until there is no state from which we can continue either the forward or backward BFS.Depending on which BFS was terminated, we then repeatedly launch the procedure continuing the other BFS, which now is bound to the area that was explored by the BFS that terminated.Since the procedures are applied globally on all states, and therefore may involve multiple independent forward-backward searches being performed in parallel, the bounding of searches is only done once no search can continue the forward (or backward) BFS.This approach does not allow once search to be bounded, while another search running in parallel is not.

Pivot selection
Finally, the other main challenge is in selecting pivots.After merging the results of the forward and backward BFS in the bookkeeping bits of results, we resolve this by hashing the current regions of states to locations in trans.In other words, we are effectively temporarily reusing the trans array as a hash table for pivot selection.Algorithm 6 without the boxed code presents our pivot selection procedure.Note that state i belongs to search region results[i], which is stored in srcinfo at line 2. At line 6, it is determined whether state i is eligible for becoming pivot.This is the case if i is both active, and not reached in both the forward and backward BFS of the previous FBT iteration.Here, reachedinBwd and reachedinFwd indicate whether the state has been reached in the backward or forward BFS, respectively.If i is eligible, a hash h is computed for it at line 17 (leader is currently not relevant, we return to this later).This hash is computed as )) mod |E|.Since this location may actually be beyond the bounds of trans, pivot selection is performed in several iterations, i.e., the procedure described in Algorithm 6 is invoked several times, each time with an incremented value iter.In each iteration, only the regions with a hash between iter • |E| and (iter + 1) • |E| are considered.Once a thread has determined the hash h, it will try to 'claim' the corresponding trans[h] entry by atomically writing the ID of its state with the highest bit set to lock the entry (lines [24][25].The atomic compare-and-swap operation (atomicCAS) takes three arguments, namely the address where the new value must be written, the expected value at the address, and the new value.It only writes the new value if the expected value is encountered, and returns the encountered value, therefore a successful write has happened if t has been returned (line 26).Exactly one thread i will be able to do this, after which that thread will store the original trans[h] entry temporarily in results[i] (line 28).All other competing threads encounter a locked trans[h] entry either at line 22 or line 26, and write this new pivot information into their results entries at line 37.The enforced data races using atomicCAS are used to pseudo-randomly choose pivots.
Finally, to revert trans back to its original content, after pivot selection, thread i needs to swap results[i] and the unlocked trans[h].This is not listed in Algorithm 6, since it can only be done after a global thread synchronisation, and therefore must be done in a different kernel.Note that with this approach, SCCs are actually identified by their pivots, and any number of pivots can be selected in parallel.

Optimisation 1: Warp-aggregated pivot selection
Enforcing data races is an elegant way to pseudo-randomly select pivots, but requiring that many threads simultaneously perform atomic operations on the global memory can potentially result in a performance bottleneck.This is because atomic operations are scheduled to be performed in sequence, and hence reduce the level of parallelism.
To mitigate this effect, we use so-called warp-aggregation to ensure that within a warp, at most one thread per new search region will attempt to make its state a pivot.Warp-aggregation is achieved in general by the fact that communication among threads in a warp can be achieved instantaneously via their local registers.The boxed code in Algorithm 6 lists our extension to the pivot selection procedure to achieve this warp-aggregation.At lines 8-14, leaders are elected for each search region considered by threads in the warp.First, at line 8, the instruction ballot(1) produces a 32-bit bitmask indicating for each of the 32 threads in the warp whether or not they are active.Among these, the smallest warp ID or Lane (Table 1) j belonging to an active thread is selected using ffs().In fact, ffs() returns j + 1, 0 indicating that no thread is active.Hence, leader election continues until there are no more active threads (line 9).
At line 10, the shfl(v, i) instruction is used, which returns the value of variable v of thread i.So in this case, the new search region information is broadcasted from leader j − 1 to all other threads in the warp.All threads considering the same search region choose j − 1 as their leader at lines 11 and 12, and subsequently break out of the while-loop, effectively 123 disabling themselves until all threads in the warp have broken out.Finally, at line 14, the next leader is identified.
Once all leaders have participated in pivot selection (line 15), results need to be propagated back to all other threads in the warp.This is done at lines 41-48 in a way similar to how the leaders are elected.Note that if the leader broadcasts an empty value (line 45 checks for this), then the leader was not able to participate in pivot selection yet, and must try again in the next pivot selection iteration (iter must be increased).

Optimisation 2: Using set J
Another approach to try to optimise the pivot selection procedure is by reducing the number of states that are eligible for becoming pivot.Any such reduction has to ensure, however, that there are states eligible for becoming pivot as long as there is still work remaining, i.e., there are still SCCs or MECs to be discovered.In Sect.2.2, we refer to this as the non-emptiness condition.
For MEC decomposition, we suggest to use the definition of J given in Sect.2.4.There, it was argued that that definition satisfies the non-emptiness condition.The sets J can be used to reduce the number of pivot candidates in the SCC decomposition step of one MEC decomposition iteration based on the results of the previous MEC decomposition iteration.In other words, this means that this proposed optimisation, in contrast to the previous one, is not applicable for 1) stand-alone SCC decomposition, 2) the very first SCC decomposition step performed for MEC decomposition.
The optimisation can be added to Algorithm 6 by extending the condition checked at line 6 with the condition that state i must be in J .Membership in J can be maintained by using one of the bookkeeping bits for this purpose (in fact, in the implementation, there is one bit still available for this purpose in the offsets entries).What remains is to update the set J after every MEC iteration.Initially, all states are in J .Once the sets U and the attractor sets have been computed (see Sect. 6), one additional scan of the states needs to be performed to identify the states that have at least one transition leading to a state set for removal.

MEC decomposition on the GPU
Our GPU implementation for MEC decomposition is based on the algorithm presented in Sect.2.4.For step 1, we use our GPU SCC decomposition.For step 2, we first reset the second and third highest bookkeeping bits in results to reuse them as follows: one bit is used to indicate that a state should be removed, and the other bit is used to mark newly discovered MECs.First, a single scan of the input suffices to identify the sets U of the various SCCs.Pseudo-code for this is presented in Algorithm 7. Before this code is executed, for each search region with pivot j, entry trans[ j] is locked to indicate that no state has yet been found in that region belonging to U .Each state i active in the search but not in U (line 3) is inspected to determine whether in each outgoing transition group there is at least one transition going out of the SCC containing i.This is done at lines 12-23, using r and found to indicate whether a group has been found that does not contain a transition leaving the SCC, and whether such a transition has been found for the currently explored group, respectively.Finally, at lines 26-27, state i is added to U if the aforementioned condition holds.Whenever a state i in the SCC with ID srcregion is identified to be in U , we unlock entry trans [srcregion] to indicate that there are states in the SCC that are also in U , and hence the SCC cannot be a MEC (line 26).
Having detected the U , we compute the intersections of the attractor sets of the U and the SCCs that they belong to; states in those sets are marked for removal.In step 3, results is scanned and all entries with region srcregion and trans[srcregion] unlocked are marked as being in a MEC.Subsequently, we repeatedly compute the attractor sets of those MECs and mark the entries for removal.Concluding, in a single scan, locked trans entries are unlocked, to be removed results entries are set to the empty value (and their offsets entry is locked), and discovered MECs are locked as well.
It is important to note that SCCs discovered in a MEC decomposition iteration must necessarily be subsets of SCCs discovered in the previous iteration.This means that we can reuse earlier results to select multiple pivots at the start of an iteration, thereby starting multiple FBT searches in parallel.

Experiments
We conducted experiments to measure the performance of our implementations using a representative set of benchmark models taken from the standard distribution of the PRISM model checker and additional models provided through its dedicated website. 4In fact, we have selected all available MDP models that were scalable to interesting proportions while not requiring more memory than our GPU could handle, and were accepted by the latest version of PRISM.All experiments were performed on machines running CentOS Linux, with an Intel E5-2620 2.0 GHz CPU, 64 GB RAM, and an NVIDIA Kepler K20m GPU.This GPU has 2,496 cores and 5 GB global memory.It should be noted that given the range of features of CUDA that we use, the implementation requires that the NVIDIA GPU on which it is executed has at least computation capability 3.0.
For all GPU experiments, we launched |V |/512 blocks of 512 threads each, i.e. one thread per state.This keeps the amount of work per thread minimal, and does not introduce idle threads that keep the scheduler busy.
Table 2 presents the graph characteristics of the cases.The 'av. out' column provides the average out-degree, the 'max.out' column the maximum out-degree, and the '#SCCs' column displays the number of SCCs in the graph.Most graphs have a very particular structure; several consist practically entirely of trivial SCCs, and others are a single SCC.We have not preselected any models, so it is interesting to note this phenomenon.It merits further study whether most MDP problems boil down to MDP graphs of one of these types.
Figure 6 presents our SCC decomposition runtime results comparing a single-core CPU implementation of Tarjan's SCC decomposition algorithm with the best configuration of our GPU FBT implementation.FBT0,7 represents GPU FBT with 0 search iterations per BFS kernel launch using the local cache, i.e., the cache is not used, and 7 being the interval (out-degree upperbound) used for restructuring the input.
For graphs consisting of only one SCC, speedups of around 30 times can be observed.This is not surprising, since these can be analysed in a single GPU search iteration.When there are many trivial SCCs present, the trimming procedure is very influential.The efficiency of the trimming procedure is bound by the average out-degree of a graph; the more connected a trivial SCC state is to other states, the more potential there is for detecting other trivial SCCs in the next trimming iteration.For this reason, the diningcrypt case can be decomposed quickly compared to the zeroconf and firewire cases.
In Fig. 7, the results for various configurations of our GPU implementation are presented.Concerning zeroconf, firewire, and cases with many non-trivial SCCs, it can be observed that the input restructuring works very well (F0,7).In most cases, speedups of about 15 times can be observed.This is significant when considering that in related work [1], only speedups up to 5-6 times were measured for graphs representing model checking problems.Besides the restructuring, the new pivot selection procedure and the data representation likely also play a role in the improved speedup, but it is hard to determine how much, since these are core aspects of our implementation that we cannot easily disable.An experimental comparison with the work of [1] seems useful, however their implementation cannot handle graphs of similar size, due to the fact that eight bits are used per integer for bookkeeping, whereas runtime (sec.)models FBT0,1 FBT0,7 FBT3,1 FBT3,7 FBT0,1-nh Fig. 7 Runtimes (in logscale) for SCC decomposition for various GPU configurations Fi, j, where i is the number of search iterations per BFS kernel launch, and j is the interval used for input restructuring.In FBT0, 1-nh, the hybrid forward-backward kernel has been disabled and thereby claim them for exploration.Which states get to be claimed by which blocks is therefore dependent on the graph structure.In turn, which offsets and trans entries will be accessed by which blocks in the next search iteration is unknown beforehand, and thus we can assume that these accesses will not be coalesced.
An overall negative result has been obtained for wlan.Its graph has a structure which considerably limits the trimming procedure.It both has a low average out-degree and only a few states from which trimming can be instantiated.
In Fig. 8, results for MEC decomposition are presented.Speedups up to 79 times were measured.The cause for the increased speedups is that the additional steps after SCC decomposition can be performed extremely efficient in parallel on a GPU, since they require a single BFS-like analysis of the states and transitions, whereas such a BFS search on the CPU is much slower.
Finally, Fig. 9 presents our runtime results when applying FBT0,7-based MEC decomposition with the two proposed pivot selection optimisations suggested in Sect.5.3, and a MEC decomposition procedure using BFBT0,7 with warp-aggregation.As can be observed, the differences in runtime performance are rather small, but some meaningful conclusions can be drawn.First of all, warp-aggregation leads most of the time to a speedup.Given the time spent in the overall computation on pivot selection, the acquired speedup is actually quite remarkable; most of the runtime is spent performing the BFS searches and the trimming, but as noted, the atomic operations performed in pivot selection do present a bottleneck.Leveraging this bottleneck has a noticeable effect on the runtimes.Of course, this does not hold for the cases where trimming suffices to detect practically all the SCCs, such as in the cases 9 and 10.
The optimisation using the J sets, however, has in a number of cases a negative effect, and can most of the time not provide any additional speedups when used together with warp-aggregation, which is the setup we investigated.This seems to suggest that the loss of potential parallelism, which we refer to in Sect.2.4, in practice outweighs the potential benefit of having fewer states (and therefore threads) competing to become pivot.Also regarding BFBT, in a number of cases, the loss of potential parallelism (due to the number of identified regions not growing as quickly in BFBT as in FBT) seems to outweigh the ability to more efficiently perform FBT searches.This is interesting, since it demonstrates that not all optimisations of a given sequential algorithm necessarily are also improvements for a parallel version of it.

Conclusions
We presented GPU algorithms for finding SCCs and MECs in sparse graphs that are based on FBT and a bounding version of it.The implementations exhibit speedups of 15-30 times for SCC decomposition and up to 79 times for MEC decomposition.A critical improvement for SCC decomposition compared to related work is achieved by improving (coalesced) data access.Other causes are a new pivot selection procedure and the chosen data representation, while techniques such as local caching do not lead to performance improvement.Furthermore, we have explained a number of potential optimisations to further improve pivot selection.Both defining smaller candidate sets and using BFBT instead of FBT did not result in consistently faster runtimes.On the other hand, warp-aggregation does lead to relatively good improvements of the runtimes, certainly considering the amount of time that pivot selection takes as part of the overall decomposition procedure.
The extra steps required for MEC decomposition are very suitable for parallelisation on GPUs, which explains the large speedups overall.
For future work, we plan to address similar problems in probabilistic model checking [5], and to integrate the algorithms in model checking tools.Furthermore, besides BFBT, a number of other optimisations have been proposed for FB-based algorithms to find SCCs, such as the one using so-called spine-sets described by Gentilini et al. [21].It would be interesting to investigate whether any of those optimisations can effectively be used in a GPU implementation.

Example 4
Consider the MEC C = {s 3 } in the running example MDP.Z = Attr(C) \ C = {s 1 , s 3 } \ {s 3 } = {s 1 }, and indeed nodes s 0 and s 1 do not belong to any MEC of the MDP, as by Lemma 1.2.

Fig. 8
Fig. 8 Runtimes (in logscale) for MEC decomposition comparing the basic decomposition algorithm against the overall best GPU configuration

Fig. 9
Fig. 9 Runtimes (in logscale) for MEC decomposition comparing different pivot selection in FBT0,7, and a version of BFBT0,7 with warp-aggregation is strongly connected and every u ∈ U is e-closed for U .Strictly speaking, a set of nodes does not uniquely identify an end-component, but for each node one also has to keep track for which distribution μ the node is e-closed for the endcomponent at hand.As this does not play a role in the sequel, we refrain from these technical details.An end-component is a finite set of nodes such that for some choice of the distributions in all these nodes, the MDP will stay in these nodes with probability one.End-components thus play a similar role as terminal SCCs in digraphs.End-components that share common nodes can be merged into a single end-component.A maximal end-component (MEC) of G is an end-component C for which there is no end-component C = C such that C ⊂ C .Observe that every vertex in V belongs to at most one maximal end-component.Node s 1 in Fig.1is e-closed for X = {s 0 , s 1 , s 2 , s 3 } since E α (s 1 ) = {s 1 , s 2 , s 3 } ⊆ X ; evidently, node s 1 is not e-closed for X = {s 1 } since E α (s 1 ) X and E β (s 1 ) = {s 2 , s 3 } X .For node s 6 we have E α (s 6 ) = {s 5 , s 6 } and E β (s 6 ) = {s 4 }.Thus, E(s 6 ) = {s 4 , s 5 , s 6 }.

Table 1
Explanation of CUDA terminology