PRISM-PSY: Precise GPU-Accelerated Parameter Synthesis for Stochastic Systems
Abstract
In this paper we present PRISM-PSY, a novel tool that performs precise GPU-accelerated parameter synthesis for continuous-time Markov chains and time-bounded temporal logic specifications. We redesign, in terms of matrix-vector operations, the recently formulated algorithms for precise parameter synthesis in order to enable effective data-parallel processing, which results in significant acceleration on many-core architectures. High hardware utilisation, essential for performance and scalability, is achieved by state space and parameter space parallelisation: the former leverages a compact sparse-matrix representation, and the latter is based on an iterative decomposition of the parameter space. Our experiments on several biological and engineering case studies demonstrate an overall speedup of up to 31-fold on a single GPU compared to the sequential implementation.
1 Introduction
Model checking of continuous-time Markov chains (CTMCs) against continuous stochastic logic (CSL) formulae [1, 27] has numerous applications in many areas of science. In biochemistry, there is an interest in analysing hypotheses (formulated using CSL) about reaction networks that can be adequately modelled as CTMCs governed by the Chemical Master Equation [9, 22, 28]. In engineering disciplines, CTMCs are used to study various reliability and performance aspects of computer networks [5], communication [19] and security protocols [29].
Traditionally, stochastic model checking techniques assume that model parameters – namely, the transition rate constants – are known a priori. This is often not the case and one has to consider ranges of parameter values instead, for example, when the parameters result from imprecise measurements, or when designers are interested in finding parameter values such that the model fulfils a given specification. Such problems can be effectively formulated in the framework of parameter synthesis for CTMCs [10, 12, 24]: given a time-bounded CSL formula and a model whose transition rates are functions of the parameters, find parameter values such that the satisfaction probability of the formula meets a given threshold, is maximised, or minimised. In [10, 12] we developed synthesis algorithms that yield answers that are precise up to within an arbitrarily small tolerance value. The algorithms combine the computation of probability bounds with the refinement and sampling of the parameter space.
The complexity of the synthesis algorithms depends mainly on the size of the underlying model and on the number of parameter regions to analyse in order to achieve the desired precision. However, existing techniques do not scale with the model size and the dimensionality of the parameter space. For instance, as reported in [12], the synthesis of two parameters for a model with 5.1 K states requires the analysis of 5 K parameter regions and takes around 3.6 h.
In the last years, many-core graphical processing units (GPUs) have been utilised as general purpose, high-performance processing resources in computationally-intensive scientific applications. In light of this development, we redesign the synthesis algorithms using matrix-vector operations so as to ensure effective data-parallel processing and acceleration of the synthesis procedures on many-core architectures. The novelty of our approach is a two-level parallelisation scheme that distributes the workload for the processing of the state space and the parameter space, in order to optimally utilise the computational power of the GPU. The state space parallelisation builds on a sparse-matrix encoding of the underlying parametric CTMC. The parameter space parallelisation exploits the fact that our synthesis algorithms require the analysis of a large number of parameter regions during the parameter space refinement.
In this paper we present our new publicly available tool PRISM-PSY^{1} that implements the data-parallel algorithms together with a number of optimisations of the sequential algorithms, and employs the front-end of the probabilistic model-checker PRISM [26]. We systematically evaluate the performance of PRISM-PSY and demonstrate the usefulness of our precise parameter synthesis methods on several case studies, including survivability analysis of the Google File System [2, 16]. Our experiments show that the data-parallel synthesis achieves on a single GPU up to a 31-fold speedup with respect to the optimised sequential implementation and that our algorithms provide good scalability with respect to the size of the model and the number of parameter regions to analyse. As a result, PRISM-PSY enables the application of precise parameter synthesis methods to more complex problems, i.e. larger models and higher-dimensional parameter spaces.
The main contributions of this paper can be summarised as follows: (1) improvement of the sequential algorithms of [10, 12], leading in some cases to more than 10-fold speedup; (2) formulation of a backward variant of the parametric transient analysis of [10] using matrix-vector operations, which enables data-parallel implementation; (3) combination of the state space and parameter space parallelisation in order to fully utilise the available computational power; (4) development of the PRISM-PSY tool that enables precise parameter synthesis on many-core architectures and (5) systematic experimental evaluation of the tool.
Related Work. The parameter synthesis problem for CTMCs and bounded reachability specifications was first introduced in [24], where the authors resort to the analysis of the polynomial function describing how the reachability probability depends on the parameter values. Due to the high degree of the polynomials (determined by the number of uniformisation steps), only an approximate solution is obtained through the discretisation of the parameter space.
The function describing how the satisfaction probability of a linear time-bounded formula depends on the parameter values can be approximated through statistical methods. A technique based on Gaussian Process regression is presented in [6] and implemented in the U-check tool [7]. In contrast to our approach, statistical methods cannot provide guaranteed precision, and thus are not suitable for safety-critical applications.
Parameter synthesis has also been studied for discrete-time Markovian models and unbounded temporal properties [15, 23]. The synthesis algorithms are based on constructing a rational function describing the satisfaction probability by performing state elimination. This approach is implemented in the tool PROPhESY [18] that supports incremental parameter synthesis using SMT techniques, but is not suitable for time-bounded specifications and CTMCs.
Our tool builds on methods for the efficient GPU parallelisation of matrix-vector multiplication [4] and probabilistic model checking [8, 33]. In our previous work [3], we showed how the algorithms for LTL model checking can be redesigned in order to accelerate verification on GPUs.
2 Precise Parameter Synthesis
In this section we summarise the parameter synthesis problem for CTMCs and time-bounded CSL properties originally introduced in [12]. We also describe the sequential synthesis algorithms of [10, 12] and the improvements implemented in the PRISM-PSY tool, which provide the foundation for the new data-parallel algorithms (Sect. 3) and the baseline for evaluating the parallelisation speedup.
2.1 Problem Formulation
Parametric continuous-time Markov chains (pCTMCs) [24] extend the notion of CTMCs by allowing transition rates to depend on parameters. We consider pCTMCs with a finite set of states S and a finite set K of parameters ranging over closed real intervals, i.e., \([k^{\bot },k^{\top }] \subseteq \mathbb {R}\) for \(k \in K\). These induce a rectangular parameter space Open image in new window. Subsets of \(\mathcal {P}\) are referred to as parameter regions or subspaces. Given a pCTMC and a parameter space \(\mathcal {P}\), we denote with \(\mathcal {C}_{\mathcal {P}}\) the set \( \{ \mathcal {C}_p \ | \ p \in \mathcal {P} \}\), where \(\mathcal {C}_p\) is the instantiated CTMC obtained by replacing the parameters in the parametric rate matrix \(\mathbf {R}\) with the valuation p.
In the current implementation of the tool, we support only linear rate functions of the following two forms: for any \(s,s' \in S\), \(\mathbf {R}(s, s') = \sum _{k \in K}k \cdot a_{k,s,s'}\) (parametric rate) or \(\mathbf {R}(s, s') = b_{s,s'}\) (constant rate) where \({a_{k,s,s'}, b_{s,s'} \in \mathbb {R}_{\ge 0}}\). Such rate functions are sufficient to describe a wide range of systems, from biological to computer systems, as we will show in Sect. 4.
To specify properties over pCTMCs, we employ the time-bounded fragment of Continuous Stochastic Logic (CSL) [1] extended with time-bounded reward operators [27]. The current version of the tool considers only unnested formulae given by the following syntax: \(\varPhi \,{:}{:}= P_{\sim r} [\phi ] \mid R_{\sim r}[C^{\le t}]\) is a state formula, \({\phi \,{:}{:}= \varPsi \ U^{I} \ \varPsi }\) is a path formula, where \(\varPsi \,{:}{:}= \mathsf {true} \mid a \mid \lnot \varPsi \mid \varPsi \wedge \varPsi \), a is an atomic proposition evaluated over states, \(\sim \ \in \left\{ <, \le , \ge , > \right\} \), r is a probability (\(r \in [0,1]\)) or reward (\(r \in \mathbb {R}_{\ge 0}\)) threshold, \(t\in \mathbb {R}_{\ge 0}\) is a time bound, and I is a time interval of \(\mathbb {R}_{\ge 0}\). The future operator, \(F^{I}\), can be derived as \(F^{I}\ \varPsi \equiv \mathsf {true} \ U^{I}\ \varPsi \). Let \(\vDash \) denote a satisfaction relation. Intuitively, a state \(s \vDash \mathsf {P}_{\sim r}[\phi ]\) iff the probability of the set of paths starting in s and satisfying \(\phi \) meets \(\sim r\). A path \(\omega = s_0t_0s_1t_1\ldots \) satisfies \(\varPhi \mathsf {U}^{I} \varPsi \) iff there exists a time \(t \in I. ( \omega @ t \vDash \varPsi \wedge \forall t' \in [0,t). \omega @ t' \vDash \varPhi )\), where \(\omega @t\) denotes the state in \(\omega \) at time t. A state \(s \vDash \mathsf {R}_{\sim p}[\mathsf {C}^{\le t}]\) iff the expected rewards over the path starting in s cumulated until t time units satisfies \(\sim p\). We remark that the synthesis algorithms can be adapted to support the full fragment of time-bounded CSL including nested formulae, as shown in [10].
The satisfaction function captures how the satisfaction probability of a given property relates to the parameters and initial state. Let \( \phi \) be a CSL path formula, \(\mathcal {C}_{\mathcal {P}}\) be a pCTMC over a space \(\mathcal {P}\) and \(s \in S\). We denote with \(\varLambda _{\phi }: \mathcal {P}\! \rightarrow \! S \!\rightarrow \! [0,1]\) the satisfaction function such that \(\varLambda _{\phi }(p)(s)\) is the probability of the set of paths (from state s) satisfying \(\phi \) in \(\mathcal {C}_{p}\). The satisfaction function for reward formulae can be defined analogously and is omitted to simplify the presentation.
We consider two parameter synthesis problems: the threshold synthesis problem that, given a threshold \(\sim r\) and a CSL path formula \(\phi \), asks for the parameter region where the probability of \(\phi \) meets \(\sim r\); and the max synthesis problem that determines the parameter region where the probability of the input formula attains its maximum, together with probability bounds approximating that maximum. Solutions to the threshold synthesis problem admit parameter points left undecided, while, in the max synthesis problem, the actual set of maximising parameters is contained in the synthesised region. The min synthesis problem is defined and solved in a symmetric way to the max case.
For \(\mathcal {C}_{\mathcal {P}}\), \(\phi \), an initial state \(s_0\), a threshold \(\sim r\) and a volume tolerance \(\varepsilon > 0\), the threshold synthesis problem is finding a partition \(\{\mathcal {T},\mathcal {U},\mathcal {F}\}\) of \(\mathcal {P}\), such that: \(\forall p \in \mathcal {T}: \varLambda _\phi (p)(s_0) \sim r\); \(\forall p \in \mathcal {F}: \varLambda _\phi (p)(s_0) \not \sim r\); and \(\text{ vol }(\mathcal {U})/\text{ vol }(\mathcal {\mathcal {P}}) \le \varepsilon \), where \(\mathcal {U}\) is an undecided subspace and \(\text{ vol }(A)=\int _{_{A}} {1} d\mu \) is the volume of A.
For \(\mathcal {C}_{\mathcal {P}}\), \(\phi \), \(s_0\), and a probability tolerance \(\epsilon > 0\), the max synthesis problem is finding a partition \(\{\mathcal {T}, \mathcal {F}\}\) of \(\mathcal {P}\) and probability bounds \(\varLambda _\phi ^{\bot }\), \(\varLambda _\phi ^{\top }\) such that: \(\forall p \in \mathcal {T}: \varLambda _\phi ^{\bot } \le \varLambda _\phi (p)(s_0) \le \varLambda _\phi ^{\top }\); \(\exists p \in \mathcal {T}: \forall p'\in \mathcal {F}: \varLambda _\phi (p)(s_0) > \varLambda _\phi (p')(s_0)\); and \({\varLambda _\phi ^{\top } - \varLambda _\phi ^{\bot } \le \epsilon }\).
2.2 Solution of the Synthesis Problems
The key ingredient for solving the aforementioned synthesis problems is a procedure that takes a pCTMC \(\mathcal {C}_{\mathcal {P}}\) and CSL path formula \(\phi \), and provides safe under- and over-approximations of the minimal and maximal probability that \(\mathcal {C}_{\mathcal {P}}\) satisfies \(\phi \): for all \(s\in S\), it computes bounds \(\varLambda _{\min }(s)\) and \(\varLambda _{\max }(s)\) such that \(\varLambda _{\min }(s) \le \inf _{p \in \mathcal {P}} \varLambda _{\phi }(p)(s) \text { and } \varLambda _{\max }(s) \ge \sup _{p \in \mathcal {P}} \varLambda _{\phi }(p)(s)\). The procedure builds on a parametric transient analysis that computes safe bounds for the parametric transient probabilities in the discrete-time process derived from the CTMC. This discretisation is obtained through standard uniformisation and the Fox and Glynn algorithm [21] that is used to derive the required number of discrete steps to consider (also called uniformisation steps or iterations) for a given time bound^{2}. See [10, 27] for more details.
We now summarise the algorithms for threshold and max synthesis based on the partitioning and iterative refinement of the parameter space [12]. Assume a threshold synthesis problem for a path formula \(\phi \) with threshold \(\ge r\). At each step, the algorithm refines the undecided parameter subspace \(\mathcal {U}\), starting from \(\mathcal {U}=\mathcal {P}\): it generates a partition \(\mathcal {D}\) of \(\mathcal {U}\) and, for each \(\mathcal {R}\in \mathcal {D}\), computes the safe probability bounds \(\varLambda _{\min }^{\mathcal {R}}\) and \(\varLambda _{\max }^{\mathcal {R}}\) of the corresponding pCTMC \(\mathcal {C}_{\mathcal {R}}\). If \(\varLambda _{\min }^{\mathcal {R}} \ge r\), then the satisfaction of the threshold is guaranteed for the whole region \(\mathcal {R}\), which is hence added to \(\mathcal {T}\). Otherwise, the algorithm tests whether \(\mathcal {R}\) can be added to \(\mathcal {F}\) by checking if \(\varLambda _{\max }^{\mathcal {R}} < r\). If \(\mathcal {R}\) is neither in \(\mathcal {T}\) nor in \(\mathcal {F}\), it forms an undecided subspace that is added to the set \(\mathcal {U}\). If the volume tolerance \(\varepsilon \) is not met, the algorithm proceeds to the next iteration, where \(\mathcal {U}\) is further refined. The refinement procedure guarantees termination since the over-approximation \([\varLambda _{\min }^{\mathcal {R}}, \varLambda _{\max }^{\mathcal {R}}]\) can be made arbitrarily precise by reducing the volume of \(\mathcal {R}\) [13].
In the max synthesis case, the algorithm starts from \(\mathcal {T}=\mathcal {P}\) and iteratively refines \(\mathcal {T}\) until the tolerance \(\epsilon \) is met. Let \(\mathcal {D}\) be the partition of \(\mathcal {T}\) at a generic step. The algorithm rules out from \(\mathcal {T}\) subspaces that are guaranteed to be in \(\mathcal {F}\), by deriving an under-approximation M of the maximum satisfaction probability. Indeed, for \(\mathcal {R}\in \mathcal {D}\), \(\varLambda _{\max }^{\mathcal {R}} < M\) implies that \(\mathcal {R}\) is in \(\mathcal {F}\). M is derived by sampling a set of parameter values from the region \(\mathcal {R}\) with the highest \(\varLambda _{\min }^{\mathcal {R}}\) and taking the highest value of the satisfaction function over these values.
Improvements on the Sequential Algorithms. The PRISM-PSY tool introduces several improvements on the prototype implementations used in [10, 12]. Here we present those having the most significant impact on performance.
(1) Backward computation of probabilistic bounds. In [10, 12], the probability bounds \(\varLambda _{\min }\) and \(\varLambda _{\max }\) are computed using a forward variant of the parametric transient analysis, which requires a separate computation of the bounds for each initial state. In our tool, we also implemented a more efficient solution that requires only a single computation for all states, based on backward computation.
(2) Adaptive Fox-Glynn. While in previous implementations the number of uniformisation steps was fixed and obtained using the maximum exit rate (sum of outgoing rates per a state) of the whole parameter space, the adaptive Fox-Glynn technique computes the number of steps for each subregion separately, using the maximum exit rate of the inspected subregion. For large parameter spaces, this technique can significantly decrease the overall number of uniformisation steps, improving the performance by more than a factor of two.
(3) Refinement Strategies. The tool employs improved refinement algorithms that can decrease the total number of subregions to analyse. Specifically, for threshold synthesis, at each step only the undecided subregions with the largest volume are refined while, for max synthesis, only the regions with either the lowest lower probability bound (\(\varLambda _\phi ^{\bot }\)) or the highest upper bound (\(\varLambda _\phi ^{\top }\)).
3 Data-Parallel Algorithms for Parameter Synthesis
In this section we first introduce the basic concepts of the target hardware architecture, i.e. modern general-purpose GPUs. We then formulate the backward variant of the parametric transient analysis using matrix-vector operations, and describe the sparse-matrix representation of pCTMCs. Finally, we present a two-level parallelisation of the synthesis algorithms. A detailed description of the data-parallel algorithms for parameter synthesis can be found in [30].
3.1 Computational Model for Modern GPUs
Typical GPUs consist of multiple Streaming Multiprocessors (SMs), with each SM following a single instruction multiple threads (SIMT) model. This approach establishes a hierarchy of threads prior to the actual computation. Within this hierarchy, threads are arranged into blocks that are assigned for parallel execution on SMs. Threads are hardwired into groups of 32 called warps, which form a basic scheduling unit and execute instructions in a lock-step manner. If a sufficient number of threads is dispatched, each SM maintains a set of active warps to hide the memory access latency and maximise utilisation of its functional units.
The SIMT approach supports code divergence within threads of the warp, but this usually causes significant performance degradation due to the serialisation of the execution. Another characteristic of GPUs that significantly affects their performance is the way in which simultaneous memory requests from multiple threads in a warp are handled. Requests exhibiting spatial locality are maximally coalesced. Simply stated, accesses to consecutive addresses are served by a single memory fetch as long as they are in the same memory segment.
A typical GPU program consists of a host code running on the CPU and a device code running on the GPU. The device code is structured into kernels that execute the same scalar sequential program in many independent data-parallel threads. The combination of out-of-order CPU and data-parallel processing GPU allows for heterogeneous computation.
3.2 Backward Computation of Probability Bounds
For a pCTMC \(\mathcal {C}_{\mathcal {R}}\) over a region Open image in new window and a target set \(A \subseteq S\), the parametric backward analysis computes a series of vectors \(\sigma _{i}^{\min }\) and \(\sigma _{i}^{\max }\) such that, for all \(s\in S\), \({\sigma _{i}^{\min }(s) \le \inf _{p \in \mathcal {P}} \sigma _{i,p}(s)}\) and \(\sigma _{i}^{\max }(s) \ge \sup _{p \in \mathcal {P}} \sigma _{i,p}(s)\), where \(\sigma _{i,p}(s)\) is the probability that, starting from the state s, a state in A is reached after i discrete steps in \(\mathcal {C}_p\). From these vectors, the probability bounds \(\varLambda _{\min }(s)\) and \(\varLambda _{\max }(s)\) are computed in a similar way to non-parametric CTMCs [27].
where \(k^{\star }\) = \(k^{\top }\) if \(\sigma _{i}^{\max }(s') > \sigma _{i}^{\max }(s)\) and \(k^\bot \) otherwise. These equations allow us to compute the vector \(\sigma _{i+1}^{\max }\) using matrix-vector operations, as shown in the implementation of \({\diamond }\) in Algorithm 1. Note that Eq. 2 is used when the transition from s to \(s'\) has a parametric rate, while Eq. 3 is used when it has a constant rate.
An approximation error is introduced because \(\sigma _{i+1}^{\max }\) is computed by optimising \(\sigma _{i+1,p}\) locally, i.e. at each step and at each state, and the error accumulates at each uniformisation step. We examine this error and its convergence in [13]. The forward variant of the parametric transient analysis can also be formulated using a vector-matrix operator [10], but the resulting code has more complex control flow and higher branch divergence, which makes parallelisation less efficient.
3.3 Sparse-Matrix Representation of Parametric CTMCs
We introduce a sparse-matrix representation of parametric CTMCs that allows us to implement the operator \({\diamond }\) in such a way that the resulting program has a similar control flow and memory access pattern as the standard matrix-vector multiplication, for which efficient data-parallel implementations exist [4, 8, 33].
We represent the data in a compact format based on the compressed sparse row (CSR) matrix format. The CSR format stores only the non-zero values of the rate matrix \(\mathbf {R}\) using three arrays: non-zero values, their column indices, and row beginnings. The CSR format is also used in the PRISM tool as the fastest explicit representation for CTMCs [26].
To handle the non-parametric transitions separately in a more efficient way, we decompose \(\mathbf {R}\) into the non-parametric matrix, stored in the CSR format, and the parametric matrix. To enable an efficient data-parallel implementation of the operator \({\diamond }\), for a region Open image in new window and for each parametric transition rate \(\mathbf {R}(s,s')\) two quantities, \(r_{s,s'}^{\bot } = \sum _{k \in K}k^{\bot } \cdot a_{k,s,s'}\) and \(r_{s,s'}^{\top } = \sum _{k \in K}k^{\top } \cdot a_{k,s,s'}\), are stored. From Eq. 2, it is enough to test \({\sigma _{i}^{\max }(s') - \sigma _{i}^{\max }(s) > 0}\) to decide whether to use \(r_{s,s'}^{\top }\) or \(r_{s,s'}^{\bot }\) in the multiplication, as illustrated in Algorithm 1.
In the parallel version, we provide an additional implementation using data structures based on the ELLPACK (ELL) sparse matrix representation [4]. The advantage of ELL over CSR is that it provides a single-stride aligned access to the data arrays, meaning that memory accesses within a single warp are reasonably coalesced. ELL yields better performance than CSR for some problems.
3.4 GPU Parallelisation
We implemented PRISM-PSY in Open Computing Language (OpenCL) [32]. In contrast to other programming frameworks, OpenCL supports multiple platforms and GPUs, and thus provides better portability. Moreover, its performance is comparable with that of specialised frameworks (e.g. CUDA [20]).
The synthesis algorithms are executed in a heterogeneous way. The sequential refinement procedure is executed on the CPU. For each parameter region \(\mathcal {R}\) to analyse, the CPU prepares a kernel that computes the probability bounds \(\varLambda _{\min }^{\mathcal {R}}\) and \(\varLambda _{\max }^{\mathcal {R}}\) on the GPU, based on the backward parametric transient analysis described above. Following [4, 8, 33], we implement a state space parallelisation, i.e. a single row of the rate matrix (corresponding to the processing of a single state) is mapped to a single computational element. Note that the models we consider typically have a balanced distribution of the state successors, and thus yield a balanced distribution of non-zero elements in the rows of the matrix. This ensures a good load balancing within the warps and blocks.
In the case of models with small numbers of states, this parallelisation is not able to efficiently utilise all computational elements, since some of them will be idle during the kernel execution. To overcome this potential performance degradation, we combine state space parallelisation with parameter space parallelisation that computes the probability bounds for multiple parameter regions in parallel. As demonstrated in the experimental evaluation (Sect. 4), this two-level parallelisation significantly improves performance on small models. In many cases, this solution can improve the runtime of large models too, because it allows the thread scheduler to better hide memory latency.
Since parallel kernel execution is unsupported by many GPU devices or it may fundamentally decrease performance, we provide a way to perform, in a single compute kernel, multiple matrix-vector operations over multiple parameter regions. The solution exploits the fact that, in our case, the rate matrices for different regions have the same structure and only differ in the values of \(r_{s,s'}^{\top }\) and \(r_{s,s'}^{\bot }\). We extend the sparse-matrix representation of the pCTMC and store the values \(r_{s,s'}^{\bot }\) and \(r_{s,s'}^{\top }\) as well as \(\sigma _{i}^{\max }(s)\) and \(\sigma _{i+1}^{\max }(s)\) for all regions. This allows us to utilise \(m\cdot |S|\) computational elements for m parallel regions. Algorithm 1 illustrates the kernel for the two-level parallelisation using the CSR format. The vectors \(\texttt {matRowBeg}\) and \(\texttt {matCol}\), the same for all regions, keep the column indices and row beginnings, respectively. The vectors \(\texttt {matValTop}\) and \(\texttt {matValBot}\) keep the non-zero values of \(r_{s,s'}^{\top }\) and \(r_{s,s'}^{\bot }\), respectively. We store only the current \(\sigma _{i}^{\max }\) and \(\sigma _{i+1}^{\max }\) using two vectors and the vectors are swapped between the iterations.
4 Experimental Evaluation
In this section we evaluate the performance of the data-parallel synthesis algorithms on case studies of biological and computer systems. We discuss how model features affect parallelisation and show that parameter synthesis can be meaningfully employed to analyse various requirements, ranging from quality of service to the reliability of synthetic biochemical networks.
All the experiments were run on a 4-core Linux workstation with an AMD Phenom™ II X4 940 Processor @ 3 GHz, 8 GB DDR2 @ 1066 MHz RAM and an NVIDIA GeForce GTX 480 GPU with 1.5 GB of GPU memory. The GPU has 14 SMs, each having 32 cores and the capability to maintain up to 48 active warps. Therefore, the GPU can simultaneously maintain and schedule up to 21504 active threads to maximise the utilisation of its computational elements.
In the following, Java denotes the optimised sequential implementation and \(\textsc {Csr}_n\) (\(\textsc {Ell}_n\)) the data-parallel implementation based on CSR (ELL) with n subregions being processed in parallel. We report only an approximate value for the number of final subregions, since it differs slightly in some experiments due to parallel processing. We also report the results for the parameter space parallelisation only up to the best performance is reached.
4.1 Google File System
Figure 2 illustrates the GSPN model of the GFS. Default values for stochastic rates can be found in [2, 16]. Files are divided into chunks of equal size. Each chunk exists in several copies, located in different chunk servers. There is one master server that is responsible for keeping the locations of the chunk copies, monitoring the chunk servers and replicating the chunks. The master can be: up and running (token at \(\mathsf {M\_up}\)); or failed (\(\mathsf {M1}\)), due to a software (\(\mathsf {M\_soft\_d}\)) or hardware (\(\mathsf {M\_hard\_d}\)) problem. The model reproduces the life-cycle of a single chunk: the numbers of available and lost copies are given by places \(\mathsf {R\_present}\) and \(\mathsf {R\_lost}\), respectively. Lost chunks are replicated through transition \(\mathsf {replicate}\). R is the maximum number of copies. We consider M chunk servers whose behaviour is analogous to that of the master. When a chunk server fails, a chunk copy is lost (\(\mathsf {destroy}\)) or not (\(\mathsf {keep}\)) depending on whether the server is storing the single chunk under consideration. We set \(M=60\) and \(R=3\), yielding a model with 21.6 K states and 145 K transitions.
Performance of the GFS model: 21.6K states, 145K transitions, and \(\le \)47K iterations per subregion. Details of the synthesis problems are reported in Fig. 3.
Threshold synthesis \(\phi _{\mathsf {GFS_1}}\) | Max synthesis f\(\phi _{\mathsf {GFS_2}}\) | Threshold synthesis \(\phi _{\mathsf {GFS_2}}\) | ||||||
---|---|---|---|---|---|---|---|---|
Impl. | Time (s) | Speedup | Impl. | Time (s) | Speedup | Impl. | Time (s) | Speedup |
Java | 842 | 1.0 | Java | 3279 | 1.0 | Java | 12221 | 1.0 |
\(\textsc {Csr}_1\) | 56 | 15.0 | \(\textsc {Csr}_1\) | 257 | 12.8 | \(\textsc {Csr}_1\) | 764 | 16.0 |
\(\textsc {Csr}_4\) | 51 | 16.5 | \(\textsc {Csr}_{4}\) | 239 | 13.7 | \(\textsc {Csr}_{8}\) | 660 | 18.5 |
\(\textsc {Csr}_{16}\) | 51 | 16.5 | \(\textsc {Csr}_{8}\) | 211 | 15.5 | \(\textsc {Csr}_{64}\) | 636 | 19.2 |
\(\textsc {Ell}_{16}\) | 41 | 20.5 | \(\textsc {Ell}_{32}\) | 207 | 15.8 | \(\textsc {Ell}_{16}\) | 505 | 24.2 |
We now evaluate a property requiring that \(\mathsf {SL_3}\) is reached strictly within the time interval [40, 60]: \(\phi _{\mathsf {GFS_2}} = \lnot \mathsf {SL_3}\ U^{[40,60]} \ \mathsf {SL_3}\). Although it is generally sought to reach the required QoS as soon as possible, this property can be used in scenarios like planned downtime, where the service does not need to be up before the time scheduled for maintenance. In Fig. 3(b), we report the results of max synthesis for parameter \(\mathsf {c\_hard\_re}\). The maximising parameters (indicated with a black arrow) are found in the region approximately given by \(\mathsf {c\_hard\_re} \in [1.2, 1.23]\), since for high repair rates \(\mathsf {SL_3}\) is reached too early.
In the last experiment, we introduce one additional parameter, \(\mathsf {c\_fail}\), i.e. the rate at which any failure (hardware or software) occurs in a chunk server. Since the GFS is designed to run on cheap commodity hardware, this rate can be controlled indirectly through the reliability of the machines used. We consider a threshold synthesis problem with property \(\phi _{\mathsf {GFS_2}}\) and threshold \(\ge 0.5\). Results in Fig. 3(c) evidence that, interestingly, the satisfaction probability is almost independent from the failure rate, except when \(\mathsf {c\_fail}\) approaches 1, and thus slightly higher repair rates are needed.
Table 1 reports the performance of the tool on the above experiments, namely, the speedup achieved by the data-parallel algorithms. Although the state space parallelisation utilises the GPU sufficiently (enough threads are dispatched), the parameter space parallelisation further improves performance, providing up to 24-fold and 16-fold speedup with respect to the sequential algorithm for threshold and max synthesis, respectively. The efficiency of the parameter space parallelisation depends on the effective usage of GPU resources, and thus the speedup does not scale with respect to the number of regions processed in parallel. In this case the adaptive Fox-Glynn technique does not bring any benefit, since the parameters we analyse do not affect the maximal exit rate.
4.2 Epidemic Model
Max synthesis for the epidemic model: parameter space \(\mathcal {P}_{\mathsf {SIR}}\) and probability tolerance\(\epsilon = 1\,\%\). \(\phi _{\mathsf {SIR}}(100,120)\): 5.1K states, 10K transitions and \(\le \) 3.1K iterations per regionand \(\sim \)3.1K final regions. \(\phi _{\mathsf {SIR}}(100,200)\): 20K states, 40K transitions and \(\le \)12K iterations per region and \(\sim \)826 final regions (depicted on the right).
Table 2 (left) lists the results obtained with PRISM-PSY on the same synthesis problem. We can see that the optimised sequential implementation is about 14-fold faster (918 sec. vs 3.6 h.). This significant acceleration is explained by: more sophisticated refinement strategy for max/min synthesis, which reduces the number of final regions to 3K (\(\sim \)2-fold speedup); the adaptive Fox-Glynn technique, which reduces the number of iterations (\({\sim 2.5\text{-fold }}\) speedup); and more efficient data structures that accelerate the computation of the probability bounds as well as the refinement procedure (\(\sim \)3-fold speedup). Note that the actual benefits of these enhancements essentially depend on the structure of the model and the synthesis problem. As the epidemic model is relatively small, the state space parallelisation is not able to sufficiently utilise the GPU, and thus the \(\textsc {Csr}_1\) implementation provides only a 2.5-fold speedup. The parameter space acceleration further improves the speedup to 5.6 (\(\textsc {ELL}_{128}\) implementation).
We now consider a more complicated variant of the problem, where we double the population size and extend the time horizon to \(t_2=200\). Results are presented in Table 2 (middle). In this case, the state space parallelisation sufficiently utilises the GPU and for \(\textsc {CSR}_{1}\), we obtain a 10.3-fold speedup. On the other hand, the parameter space parallelisation reduces the benefits of the adaptive Fox-Glynn technique, and thus overall performance is improved only slightly. Table 2 (right) depicts the results of max synthesis for the larger variant.
4.3 Signalling in Prokaryotic Cells
Threshold synthesis for the signalling model: property \(\varPhi _{\mathsf {SIG}}\), parameter space \(\mathcal {P}_{\mathsf {SIG}}\) and \(\varepsilon = 9\,\%\). The variant from [14]: 116K states, 954K transitions, \(\le 19K\) iterations per region and \(\sim \)70 final regions (depicted on the right). The larger variant: 424K states, 3.6M transitions, \(\le \) 24K iterations per region, and \(\sim \)67 final regions.
As shown in Table 3, a speedup up to 24.7 is obtained using \(\textsc {ELL}_{4}\), which further improves the GPU utilisation and the memory access pattern of the pure state space parallelisation. We also consider a larger variant of the model (about 3.6-times), for which we obtain an even better speedup (up to \({31.1\text{-fold }}\)), so demonstrating good scalability of the data-parallel algorithm. Table 3 (right) depicts the synthesis results for the small variant, evidencing the non-monotonicity of the satisfaction function for the reward property.
4.4 Approximate Majority
Threshold synthesis for the approximate majority: property \(\varPhi _{\mathsf {AM}} \), parameter space \(\mathcal {P}_{\mathsf {AM}}\) and \({\varepsilon = 10\,\%}\). 120 states, 273 transitions, \(\le \) 700K iterations per region and \({\sim }911\) final regions.
Runtime (s) | Speedup | |
---|---|---|
Java | 1097 | 1.0 |
\(\textsc {Csr}_1\) | 11375 | 0.1 |
\(\textsc {Csr}_4\) | 3057 | 0.4 |
\(\textsc {Csr}_{16}\) | 951 | 1.2 |
\(\textsc {Csr}_{64}\) | 319 | 3.4 |
\(\textsc {Csr}_{128}\) | 195 | 5.6 |
\(\textsc {Ell}_{128}\) | 193 | 5.7 |
Due to the low number of states, the state space parallelisation utilises only a small portion of the computational elements of the GPU. Therefore, the GPU parallelisation using a small number of parallel parameter regions slows down the computation, as shown in Table 4. Increasing the number of parallel regions (up to 128) improves the GPU utilisation, and hence performance, yielding up to 5.7-fold speedup.
This experiment also demonstrates that, in contrast to the Monte Carlo-based optimisation, precise parameter synthesis provides detailed information about the impact of parameters on the probability of correct decision, as shown in Fig. 4. Note that the backward transient analysis implemented in our tool computes the probability bounds for all the reachable states, in this case all the inputs satisfying \({A+B=14}\).
4.5 Workstation Cluster
Finally, we consider a model describing a cluster of workstations consisting of two sub-clusters with N workstations in each, connected in a star topology [25, 34]. Both sub-clusters have their own switch that connects the workstations in the sub-cluster with a central backbone. The cluster maintains the minimum quality of service if at least 75 % of the workstations are operational and connected. We assume that one can control the workstation inspection (\(ws\_check\)), repair (\(ws\_repair\)) and failure (\(ws\_fail\)) rates.
Threshold synthesis for the cluster model: property \(\varPhi _{\mathsf {CLU}}\), parameter space \(\mathcal {P}_{\mathsf {CLU}} \) and \(\varepsilon = 10\,\%\). 86K states, 415K transitions, \(\le \)9K iterations per region and \({\sim }273\) final regions.
Runtime (s) | Speedup | |
---|---|---|
Java | 12074 | 1.0 |
\(\textsc {Csr}_1\) | 637 | 19.0 |
\(\textsc {Csr}_{4}\) | 674 | 17.9 |
\(\textsc {Ell}_{4}\) | 672 | 18.0 |
We synthesise the parameters such that the minimum quality of service is not maintained at most 0.1 % of the time. This is formalised as \(\varPhi _{\mathsf {CLU}} = R_{\le 0.1 \cdot t} [C\le t]\), and associating a reward of 1 to states where the minimum quality of service is not provided. In this experiment, we use \(t=100\) and parameter space \({\mathcal {P}_{\mathsf {CLU}} = ws\_check \times ws\_repair \times ws\_fail \in [5,20]\times [0.5,5] \times [0.001,0.02]}\).
Table 5 presents the results for \({N = 48}\) (85K states and 415 K transitions). In this model, the parameter space parallelisation considerably reduces the benefits of the adaptive Fox-Glynn technique, and thus the \(\textsc {CSR}_{1}\) implementation provides the best performance, leading to a 19-fold speedup.
4.6 Result Analysis
State space parallelisation improves the scalability of the computation with respect to the model size. The experiments demonstrate that the speedup compared to the sequential baseline tends to improve with the number of states (see Tables 2 and 3). On the other hand, for smaller models, an insufficient number of threads is dispatched, leading to performance degradation (see Table 4). In most cases, the ELL format moderately outperforms the CSR format and it also works better with the parameter space parallelisation due to the more coalesced memory access pattern. Since the refinement procedure (running solely on CPU) is more complicated for max/min synthesis, for these instances the overall speedup is lower than that for threshold synthesis.
Parameter space parallelisation allows us to efficiently utilise the GPU even for small models. It scales well up to reaching the maximal number of active threads that can be dispatched (see Table 4). In practice, performance usually increases even beyond this point, since the parameter space parallelisation can improve the memory access locality. On the other hand, it mitigates the advantage of the adaptive Fox-Glynn technique, which can lead to performance degradation, as reported in Table 5. Importantly, PRISM-PSY can also be configured to perform this parallelisation using multi-core CPUs (not discussed here).
Our experiments clearly indicate that the tool is able to provide good scalability with respect to the number of computational elements. Since the two-level parallelisation can tune the GPU utilisation for various synthesis problems, we expect that the execution of the tool on new generations of GPUs with a larger number of cores will lead to a further improvement in acceleration.
5 Conclusion
We have introduced the tool PRISM-PSY that performs precise parameter synthesis for CTMCs and time-bounded specifications. In order to overcome the high computational demands, we have developed data-parallel versions of the algorithms allowing us to significantly accelerate synthesis on many-core GPUs. As a result, the tool provides up to 31-fold speedup with respect to the optimised sequential implementation, and thus considerably extends the applicability of precise parameter synthesis. In future we will extend the tool to support the full fragment of time-bounded CSL and multi-affine rate functions [13].
Footnotes
References
- 1.Aziz, A., Sanwal, K., Singhal, V., Brayton, R.: Verifying continuous time Markov chains. In: Alur, R., Henzinger, T.A. (eds.) CAV 1996. LNCS, vol. 1102. Springer, Heidelberg (1996)CrossRefGoogle Scholar
- 2.Baier, C., Hahn, E.M., Haverkort, B.R., Hermanns, H., Katoen, J.-P.: Model checking for performability. Math. Struct. Comput. Sci. 23(04), 751–795 (2013)MathSciNetCrossRefMATHGoogle Scholar
- 3.Barnat, J., Bauch, P., Brim, L., Češka, M.: Designing fast LTL model checking algorithms for many-core GPUs. J. Parallel Distrib. Comput. 72(9), 1083–1097 (2012)CrossRefGoogle Scholar
- 4.Bell, N., Garland, M.: Efficient sparse matrix-vector multiplication on CUDA. NVIDIA Technical report NVR-2008-004, NVIDIA Corporation (2008)Google Scholar
- 5.Bolch, G., Greiner, S., de Meer, H., Trivedi, K.S.: Queueing Networks and Markov Chains: Modeling and Performance Evaluation with Computer Science Applications. Wiley, New Jersey (2006)CrossRefMATHGoogle Scholar
- 6.Bortolussi, L., Milios, D., Sanguinetti, G.: Smoothed model checking for uncertain continuous time markov chains. CoRR ArXiv, 1402.1450 (2014)Google Scholar
- 7.Bortolussi, L., Milios, D., Sanguinetti, G.: U-check: model checking and parameter synthesis under uncertainty. In: Campos, J., Haverkort, B.R. (eds.) QEST 2015. LNCS, vol. 9259, pp. 89–104. Springer, Heidelberg (2015)CrossRefGoogle Scholar
- 8.Bošnački, D., Edelkamp, S., Sulewski, D., Wijs, A.: Parallel probabilistic model checking on general purpose graphics processors. Int. J. Softw. Tools Technol. Transf. 13(1), 21–35 (2010)Google Scholar
- 9.Brim, L., Češka, M., Šafránek, D.: Model checking of biological systems. In: Bernardo, M., de Vink, E., Di Pierro, A., Wiklicky, H. (eds.) SFM 2013. LNCS, vol. 7938, pp. 63–112. Springer, Heidelberg (2013)CrossRefGoogle Scholar
- 10.Brim, L., Češka, M., Dražan, S., Šafránek, D.: Exploring parameter space of stochastic biochemical systems using quantitative model checking. In: Sharygina, N., Veith, H. (eds.) CAV 2013. LNCS, vol. 8044, pp. 107–123. Springer, Heidelberg (2013)CrossRefGoogle Scholar
- 11.Cardelli, L., Csikász-Nagy, A.: The cell cycle switch computes approximate majority. Scientific Reports, 2 (2012)Google Scholar
- 12.Češka, M., Dannenberg, F., Kwiatkowska, M., Paoletti, N.: Precise parameter synthesis for stochastic biochemical systems. In: Mendes, P., Dada, J.O., Smallbone, K. (eds.) CMSB 2014. LNCS, vol. 8859, pp. 86–98. Springer, Heidelberg (2014)Google Scholar
- 13.Češka, M., Dannenberg, F., Paoletti, N., Kwiatkowska, M., Brim, L.: Precise parameter synthesis for stochastic biochemical systems. Submitted to Acta Informatica (2015)Google Scholar
- 14.Česka, M., Šafránek, D., Dražan, S., Brim, L.: Robustness analysis of stochastic biochemical systems. PLoS ONE 9(4), e94553 (2014)CrossRefGoogle Scholar
- 15.Chen, T., Hahn, E.M., Han, T., Kwiatkowska, M., Hongyang, Q., Zhang, L.: Model repair for Markov decision processes. In: Theoretical Aspects of Software Engineering (TASE), pp. 85–92. IEEE (2013)Google Scholar
- 16.Cloth, L., Haverkort, B.R.: Model checking for survivability! In: Quantitative Evaluation of Systems (QEST), pp. 145–154. IEEE (2005)Google Scholar
- 17.Dalchau, N., Murphy, N., Petersen, R., Yordanov, B.: Synthesizing and tuning chemical reaction networks with specified behaviours. In: Phillips, A., Yin, P. (eds.) DNA 2015. LNCS, vol. 9211, pp. 16–33. Springer, Heidelberg (2015)CrossRefGoogle Scholar
- 18.Dehnert, C., Junges, S., Jansen, N., Corzilius, F., Volk, M., Bruintjes, H., Katoen, J.-P., Ábrahám, E.: PROPhESY: a probabilistic parameter synthesis tool. In: Kroening, D., Păsăreanu, C.S. (eds.) CAV 2015. LNCS, vol. 9206, pp. 214–231. Springer, Heidelberg (2015)CrossRefGoogle Scholar
- 19.Duflot, M., Kwiatkowska, M., Norman, G., Parker, D.: A formal analysis of bluetooth device discovery. Int. J. Softw. Tools Technol. Transfer 8(6), 621–632 (2006)CrossRefGoogle Scholar
- 20.Fang, J., Varbanescu, A.L., Sips, H.: A comprehensive performance comparison of CUDA and OpenCL. In: International Conference on Parallel Processing (ICCP). IEEE (2011)Google Scholar
- 21.Fox, B.L., Glynn, P.W.: Computing poisson probabilities. Commun. ACM 31(4), 440–445 (1988)MathSciNetCrossRefGoogle Scholar
- 22.Daniel, T.: Gillespie.: exact stochastic simulation of coupled chemical reactions. J. Phys. Chem. 81(25), 2340–2361 (1977)CrossRefGoogle Scholar
- 23.Hahn, E.M., Hermanns, H., Zhang, L.: Probabilistic reachability for parametric Markov models. Int. J. Softw. Tools Technol. Transfer (STTT) 13(1), 3–19 (2011)CrossRefGoogle Scholar
- 24.Han, T., Katoen, J.-P., Mereacre, A.: Approximate parameter synthesis for probabilistic time-bounded reachability. In: Real-Time Systems Symposium (RTSS), pp. 173–182. IEEE (2008)Google Scholar
- 25.Haverkort, B.R., Hermanns, H., Katoen, J.-P.: On the use of model checking techniques for dependability evaluation. In: Symposium on Reliable Distributed Systems (SRDS). IEEE (2000)Google Scholar
- 26.Kwiatkowska, M., Norman, G., Parker, D.: PRISM 4.0: verification of probabilistic real-time systems. In: Gopalakrishnan, G., Qadeer, S. (eds.) CAV 2011. LNCS, vol. 6806, pp. 585–591. Springer, Heidelberg (2011)CrossRefGoogle Scholar
- 27.Kwiatkowska, M., Norman, G., Parker, D.: Stochastic model checking. In: Bernardo, M., Hillston, J. (eds.) SFM 2007. LNCS, vol. 4486, pp. 220–270. Springer, Heidelberg (2007)CrossRefGoogle Scholar
- 28.Madsen, C., Myers, C.J., Roehner, N., Winstead, C., Zhang, Z.: Utilizing stochastic model checking to analyze genetic circuits. In: Computational Intelligence in Bioinformatics and Computational Biology (CIBCB), pp. 379–386. IEEE (2012)Google Scholar
- 29.Norman, G., Shmatikov, V.: Analysis of probabilistic contract signing. J. Comput. Secur. 14(6), 561–589 (2006)Google Scholar
- 30.Pilař, P.: Accelerating parameter synthesis for stochastic models. Master’s thesis, Faculty of Informatics, Masaryk University, Czech Republic (2015)Google Scholar
- 31.Steuer, R., Waldherr, S., Sourjik, V., Kollmann, M.: Robust signal processing in living cells. PLoS Comput. Biol. 7(11), e1002218 (2011)MathSciNetCrossRefGoogle Scholar
- 32.Stone, J.E., Gohara, D., Shi, G.: OpenCL: a parallel programming standard for heterogeneous computing systems. Comput. Sci. Eng. 12(3), 66–73 (2010)CrossRefGoogle Scholar
- 33.Wijs, A.J., Bošnački, D.: Improving GPU sparse matrix-vector multiplication for probabilistic model checking. In: Donaldson, A., Parker, D. (eds.) SPIN 2012. LNCS, vol. 7385, pp. 98–116. Springer, Heidelberg (2012)CrossRefGoogle Scholar
- 34.PRISM - Case Studies - Workstation Cluster. http://www.prismmodelchecker.org/casestudies/cluster.php. Accessed September 2015