Keywords

1 Introduction

Parity games are one of the most useful and effective algorithmic tools used in automated formal verification [2, 5, 18]. Indeed, several computational problems, such as model checking and automated synthesis using temporal logic specifications, can be reduced to the solution of a parity game [2, 5]. More formally, a parity game is a two-player zero-sum game of infinite duration played on a finite graph. Since these games are determined [8, 14], solving them is equivalent to finding a winning strategy for one of the two players in the game; or, similarly, deciding from which vertices in the graph one of the two players in the game can force a win no matter the strategy that the other player makes use of. The main question regarding parity games is that of the computational complexity of finding a solution of the game, a problem that is known to be in \(\text{ NP }\cap \text{ co-NP }\) [11]. However, despite decades of research, a polynomial-time algorithm to solve such games remains elusive. The best-known decision procedures to solve parity games, most of them recently developed [4, 13], run in quasi-polynomial time, which provide better worst-case complexity upper bounds than previous exponential-time approaches [18] found in the parity games literature.

The importance of parity games in the solution of real-life automated verification problems, and the lack of a polynomial-time decision procedure to solve such games, has motivated the development and implementation of algorithms that can solve parity games somewhat efficiently in practice, despite their known worst-case exponential time complexity. In the quest for developing such decision procedures, several different approaches have been investigated in the last two decades, ranging from solutions that try to improve/optimise on the choice of high-level algorithm to reason about parity games, the programming language used to implement such a solution, the concrete data structures used to represent the games, or the type of hardware architecture used for deployment [6, 7, 9, 17].

Progress solving parity games in practice has been made in different directions. In [7], a state-of-the-art implementation of the best-known algorithms for solving parity games was presented. In this work, two algorithms were found to deliver the best performance in practice, namely, Zielonka’s recursive algorithm (ZRA [18]) and priority promotion [3], with the former showing slightly better performance when solving random games and a selection of structured games for model checking, and the latter outperforming ZRA when solving a selection of structured games for equivalence checking. But, overall, the two algorithms expose extremely similar performance in practice, including that of a parallel implementation of ZRA. Another attempt to improve the performance of solving parity games is presented in [6]. In this work, better performance is sought through a parallel implementation of ZRA, known to consistently expose the best performance in different platforms and for different types of games.

These two works [6, 7] contain two strikingly opposing conclusions. While in [7] the parallel implementation of ZRA is even outperformed by the best sequential implementation of the same algorithm, in [6] significant gains in performance are observed when parallelising the computation of ZRA – which may solve a large set of random parity games between 3.5 and 4 times faster than the sequential implementation of the same algorithm. These two results, arguably, both conforming with the state of the art in the solution of parity games in practice, indicate that no definitive conclusion can be made into what the best approach to solving parity games in practice is, let alone whether considering a parallel implementation would necessarily produce better results than its sequential version. In this paper, we present a new approach to solving parity games, and investigate some of the issues exposed by the two above papers.

More specifically, motivated by the need to find effective new techniques for solving parity games, in particular in large practical settings, in this paper we:

  1. 1.

    propose a novel matrix-based approach to solving parity games, based on ZRA [13, 18], arguably, the best-performing algorithm in practice [7];

  2. 2.

    study the complexity of our matrix-based procedure, and show that it retains the optimal complexity bounds of the best algorithms for parity games [13];

  3. 3.

    develop a parallel implementation, which takes advantage of methods and hardware for matrix manipulation using sophisticated GPU technologies;

  4. 4.

    investigate a number of alternative implementations of our matrix-based approach in order to better assess its usefulness in practical settings.

Our matrix-based approach, whose parallel implementation outperforms the state-of-the-art solvers for parity games, consists in the reduction of key operations on parity games as simple computations on large matrices, which can be significantly accelerated in practice using sophisticated techniques for matrix manipulation, specifically, using modern GPU technologies. Firstly, our matrix-based approach partly builds on the observation that most of the computation time when using ZRA is spent running a particular subroutine called the “attractor” function, which we can parallelise. Secondly, we also rely on the observation that computations on matrices – which guide the search for the solution of parity games within our approach – can be efficiently parallelised using a combination of both algorithmic techniques for parallel computation and GPU devices.

2 Preliminaries

A parity game is two-player zero-sum infinite-duration game played over a finite directed graph \(G = (V_0, V_1, E, \varOmega )\), where \(V = V_0 \cup V_1\) is a set of vertices/nodes partitioned into vertices \(V_0\) controlled by Player Even/0 and vertices \(V_1\) controlled by Player Odd/1. Whenever a statement about both players is made, we may use the letter q \((\in \{0,1\})\) to refer to either player, and \(1-q\) to refer to the other player in the game. Without any loss of generality, we also assume that every vertex in the graph has at least one successor. Moreover, the function \(\varOmega : V \rightarrow \textrm{N}\) is a labelling function on the set of vertices of the graph which assigns each vertex a priority. Intuitively, the way a parity game is played is by moving a token along the graph (starting from some designated node in V), with the owner of the node of which the token is on selecting a successor node in the graph. Because every vertex has a successor, this process continues indefinitely, producing a infinite sequence of visited nodes, and consequently an infinite sequence of seen priorities. The winner of a particular play is determined by the highest priority that occurs infinitely often: Player 0 wins if the highest infinitely recurring priority is even, while Player 1 wins if the highest infinitely recurring priority is odd. Parity games are determined, which means that it always the case that one of the two players has a strategy (called a winning strategy) that wins against all possible strategies of the other player. Solving a parity game amounts to deciding, for every node in the game, which player has a winning strategy for the game starting in such a node. That is computing disjoint sets \(W_0\subseteq V\) and \(W_1\subseteq V\) such that Player q has a winning strategy to win every play in the game that starts from a node in \(W_q\), with \(q\in \{0,1\}\).

Somewhat surprisingly, the best performing algorithm to solve parity games in practice is Zielonka’s Recursive Algorithm (ZRA [18]), which runs in exponential time in the number of priorities, bounded by |V|. This algorithm is rather simple, and mostly relies on the computation of attractor sets, which are sets of vertices \(A = Attr_q(X)\) inductively defined for each Player q as shown below – and used to computing both \(W_0\) and \(W_1\) recursively. Formally, the attractor function \(Attr_q: \mathcal {P}(V) \rightarrow \mathcal {P}(V)\) for Player q, computes the attractor set of a given set of vertices \(U\subseteq V\), and is defined inductively as follows:

$$\begin{aligned} Attr_q^0 (U) =&\ U \\ Attr_q^{n+1} (U) =&\ Attr_q^{n} (U) \\&\ \cup \{u \in V_q \ | \ \exists v \in Attr_q^{n} (U): (u,v)\in E \} \\&\ \cup \{u \in V_{1-q} \ | \ \forall v \in V: (u,v)\in E \Rightarrow v \in Attr_q^{n} (U) \} \\ Attr_q (U) =&\ Attr_q^{|V|} (U) \end{aligned}$$

As shown in Algorithm 1, ZRA [18] finds disjoint sets of vertices \(W_0\)/\(W_1\) from which Player 0/1 has a winning strategy. Through the computation of attractor sets, the algorithm works by recursively decomposing the graph, finding sets of nodes that could be forced towards the highest priority node(s), and hence building the winning regions \(W_0\) and \(W_1\) for each player in the game.

figure a

3 A matrix-based approach

Experimental results from [7] motivated us to investigate whether ZRA can be improved in practice, since such an algorithm shows the best performance both in random games as well as in several structured games found in practical settings. This finding is complemented by the observation made in [6], that when running ZRA most of the time is spent in the computation of attractor sets, reported to be about 99% in [6] (with experiments considering random games only), and found to be of about 77% in our study (which considers larger classes of games).

Our observation, and working hypothesis, not found in previous work [6, 7], is that the basic ZRA can be highly optimised in practice if its main computation component – the attractor set subroutine – is accelerated using efficient techniques for matrix manipulation, should a representation of the attractor set procedure was based on computations/operations on matrices encoding the attractor set subroutine in ZRA. This is precisely what we do in this section, which in turn makes our approach incredibly appropriate for an implementation in parallel using modern GPUs technologies for efficient matrix manipulation.

To achieve a matrix-based encoding of ZRA, and in particular of its attractor set subroutine, we redefine the representation of the graph in terms of a sparse adjacency matrix A, a vector defining the ownership of every node \(\textbf{o}\), and a vector \(\boldsymbol{\omega }\) defining the priority of every node. Due to the potentially high computational cost of copying A, we maintain a vector \({\textbf {g}}\) representing which nodes are still included in the game (a subgame being computed at that point in the algorithm), which is copied and updated as Zielonka’s algorithm recurses and decomposes the graph into ever smaller parts. As such, we are able to find \({\textbf {d}} = A{\textbf {g}}\), a vector containing the maximum out-degree of every node. More specifically:

  • \((A)_{i j} = 1, \text { if edge exists connecting } i \text { and } j; (A)_{i j} = 0, \text { otherwise}\);

  • \((\textbf{o})_i = q, \text { if node } i \text { belongs to player }q\);

  • \((\boldsymbol{\omega })_i = \varOmega (V_i)\);

  • \((\textbf{g})_i = 1, \text { if node } i \text { is in the game}; (\textbf{g})_i = 0, \text { otherwise}\).

With these definitions in place, we can make the necessary modifications to the attractor function presented before – see Algorithm 2. The input/output vector \(\textbf{t}\) contains 1 at position \((\textbf{t})_i\) where a node i is part of the attractor set and 0 otherwise. We thus define vectorised operations where if a vector is compared to another vector, then the comparisons are done element-wise. If a vector is compared to a scalar, then the scalar s is implicitly converted, \(\textbf{s} = s \textbf{1}\). The \(\odot \) operator denotes the Hadamard product, which is used primarily as a Boolean And operation. The argument q is the player: 0 for Player 0 and 1 for Player 1.

figure b

This algorithm works by first finding the number of outbound edges each node has (\(\textbf{d} \leftarrow A \textbf{g}\)), and at each iteration finding how many ways each node can enter the attractor set (\(\textbf{v} \leftarrow A \textbf{t}\)). It then finds nodes that q owns that may enter the attractor set (\((\textbf{o} = q) \odot (\textbf{v} > 0)\)), and nodes that q do not own that are forced to enter the attractor set (\((\textbf{o} = (1-q)) \odot (\textbf{v} = \textbf{d})\)). It then filters the nodes to include into the attractor set depending on which nodes are still included in the subgraph (\(\textbf{g} \odot (\cdots )\)), and breaks the loop when there is no difference between \(\textbf{t}\) and \(\textbf{t}'\). To illustrate this procedure, take as an example the graph below.

For this example, assume that \({\textbf {g}} = {\textbf {1}}\) and that we are computing the attractor set for the player that own the circle nodes, starting from the node with priority 7. After 1 (or some arbitrary number of iteration(s)), the current state is reached. Green nodes denote nodes included in the previous iteration’s attractor set, and yellow nodes denote nodes that will be included in this iteration. The calculations that may be performed are as follows. Define the adjacency matrix of the graph (A), the currently included nodes in the attractor set, \( \textbf{t} = \left( 1 \ 1 \ 0 \ 0 \ 0 \right) ^\top \), the ownership of every node, \( \textbf{o} = \left( 0 \ 0 \ 1 \ 1 \ 0 \right) ^\top \), and the degree – number of outbound edges – of every node, \( \textbf{d} = A \textbf{g} = \left( 1 \ 1 \ 2 \ 2 \ 1 \right) ^\top \). Now, compute the number of edges from each node leading to an element in the current attractor set, that is, \( \textbf{v} = A \textbf{t} = \left( 1 \ 1 \ 2 \ 1 \ 1 \right) ^\top \), and with that, update \(\textbf{t}\), to obtain: \( \textbf{t} \leftarrow \left( 1 \ 1 \ 1 \ 0 \ 1 \right) ^\top \), which exactly represents the value of the attractor function one step later. Similar changes for ZRA in terms of the representation of the game must also be made, so that it becomes, fully, a matrix manipulation algorithm (Algorithm 3).

figure c

The correctness of the algorithm remains unchanged from that of ZRA since our encoding into matrix operations is functional. Less clear is whether our algorithm retains the ZRA’s complexity, since using a functional mapping does not necessarily imply that the encoding (our representation) has the complexity of the encoded instance (i.e., the original problem). We study this question next.

3.1 Complexity

Using the algorithms defined before, we derive a function R(dn) that bounds the maximum number of recursive calls to ZRA, given a d number of distinct priorities and n nodes: \(R(d,n) = 1 + R(d-1,n-1) + R(d,n-1)\). The 1 is the original call; the 1st recursive call is made with at least the vertex with the largest priority removed, and the second is made with at least one vertex removed. Hence, the construction above. There are two base cases \(R(d,0) = R(0,n) = 1\). Firstly, we observer that based on the algorithms herein defined, we get:

$$\begin{aligned} R(d,n)&= 1 + R(d-1,n-1) + R(d,n-1) \\&= (n+1) + \sum _{i=1}^n[R(d-1,n-i)] \end{aligned}$$

Moreover, R(dn) is then given by: \(f(d,n) = 2\sum _{j=0}^d{n \atopwithdelims ()j} - 1\). For the base case, when \(d=1\), we note that \(R(1,n) = (n+1) + \sum _{i=1}^n[R(0,n-i)] = 2n + 1\) and \(f(1,n) = 2\sum _{j=0}^1{n\atopwithdelims ()j} - 1 = 2(n + 1) - 1 = 2n + 1= R(1,n)\), as required, for all n. For the inductive case, assume that \(R(d,n) = f(d,n)\), for \(d=k\) and all n.

$$\begin{aligned} R(k+1,n)&= (n+1) + \sum _{i=1}^n[R(k, n-i)] \\&= (n+1) + \sum _{i=1}^n[f(k,n-i)] \\&= 1 + 2\sum _{i=1}^n\sum _{j=0}^k{n-i \atopwithdelims ()j} = 2\sum _{j=0}^{k+1}{n \atopwithdelims ()j} - 1 = f(k+1, n) \end{aligned}$$

Hence, the statement is true for the base case \(d=1\) and all n, while the inductive case \(d=k\) implies \(d=k+1\). Thus, by induction, \(R(d,n)=f(d,n)\) for \(d\ge 1\) and all n. We now observe that the worst case number of calls occurs, as expected, at \(d=n\) where \(R(n,n)=2^{n+1}-1\). Note that the complexity of a single call to MatZielonka has time complexity \(O(n^3)\) (dominated by the complexity of calls to the matrix-based Attr subroutineFootnote 1) and space complexity O(n), delivering worst-case complexities of \(O(n^3\cdot 2^{n})\) time and \(O(n\cdot 2^{n})\) space.

This result, negative in theory, is consistent with that of the worst-case complexity of ZRA, which indicates that our matrix-based encoding retains the same complexity properties of the original algorithm. More interestingly, is the fact that the quasi-polynomial extension of ZRA by Parys [16], and later improved by Lehtinen et al [13], can also be tackled with our approach while retaining the quasi-polynomial complexity. However, a matrix-based extension of the latter algorithm was not evaluated. Thus, its practical usefulness is yet to be studied.

4 Implementation and evaluation

Several factors influence the practical performance of a computational solution to a problem: for instance, (1) the algorithm used to solve the problem, (2) the programming language to implement the solution, (3) the concrete data structures used to represent it, and (4) the hardware where the solution is deployed. Our solution tries to optimise 1–4 using both lessons learnt from previous research and properties of our own matrix-based approach. Details are given later, but in short, in this section, five parity game solvers are implemented and evaluatedFootnote 2:

  1. I1

    our basic matrix-based approach, presented in the previous section;

  2. I2

    its parallel implementation for deployment using GPU technologies;

  3. I3

    the improved implementation of the attractor function of ZRA in [6];

  4. I4

    the highly optimised C++ implementation of ZRA presented in [7];

  5. I5

    the unoptimised version of the above algorithm, also in [7].

Apart from (2), the five implementations above (I1–I5) will allow you to have a comprehensive evaluation of our approach, both against different versions of our own work and against previous research. The only aspect that all the solutions we present in this section have in common is the programming language used for implementation, which is C++, at present the language offering the most efficient practical implementation of parity games solutions; cf. [6, 7, 9, 17]. We first present the characteristics of our matrix-based approach, deployed both as a sequential algorithm and as a parallelised procedure. After that, we will describe key features of the solutions originally developed elsewhere, and continue with the results of the evaluation using different types of parity games.

Matrix-based approach.Footnote 3 Whilst it is important to find performance from parallelisable operations, it is equally important to avoid the loss of performance from executing inefficient or slow operations. Specific algorithmic design choices such as maintaining a vector \(\textbf{g}\) to track nodes that are in or out of the graph are done to avoid otherwise necessary operations such as copying the adjacency matrix, which would otherwise be slow, especially when solving very large games.

Additionally, all values in vectors and matrices are stored as single precision floating point values in practice. This is due to the software limitations of the Compute Unified Device Architecture (CUDA) [15] library, which are likely limitations of the underlying hardware itself. In particular, this limits the maximum out-degree of a node to \(2^{24}\), which corresponds to the number of bits in the mantissa of a single precision floating point number (23), plus one. Beyond this limit, the accuracy of the values computed in operations such as computing the maximum out-degree of a node with \(A \textbf{g}\) would no longer be guaranteed, along with the correctness of the algorithm. We note that this limitation may be overcome by splitting a single node into multiple nodes, thus curbing the maximum out degree to an acceptable range. We do not do this for these experiments as this transformation has unknown impacts on the performance of the algorithm.

figure d

The invocation of functions that run on the GPU (known as kernels) have an overhead, with the overhead duration varying somewhat between devices. As a consequence, tuning for a particular problem depends on the functions being executed and the GPUs themselves. Thus, there are periods where the device is idle, and this is a result of the overheads. Also note that in practice, it is usually faster to perform multiple iterations of the attractor computation as performing an iteration when the full attractor set has already been computed does not alter the results (Algorithm 4). This is because queueing multiple kernel invocations has the same overhead as calling one kernel alone. The main difference between our sequential and parallel implementations of the matrix-based method is the function computing attractor sets, which is as in Algorithm 2 in the sequential case, and as in Algorithm 4 in the parallel case. The code in \(\ldots \) is the same in both implementations, and the key difference is that we set the execution of the parallel implementation to make 3 kernel invocations per execution of the attractor function – which in lucky cases may require only 1 kernel invocation, while in unlucky cases may require more than 3 kernel invocations, increasing overheads; for our problem, we found that 3 kernel invocations was appropriate.

We find that there is another possible point of optimisation as the time taken for the attractor computation would be approximately equal to \(c t_c + n t_o\), where c is the number of attractor computations (the inside section of the for loop), n is the number of times the outer while loop will run, \(t_c\) is the time to run the for loop once, and \(t_o\) is the overhead incurred by switching execution from device (GPU) to host (CPU) as the condition is checked in the while loop. Ideally, \(c=C+1\), and \(n=1\), where C is the (unknown) number of attractor computations required. Our implementation loops the inner for loop an arbitrary constant number of times (3 times here). As such, \(C + 1 \le c \le C + 3\), and \(n = \lceil \frac{C}{3} \rceil \).

Importantly, requirements for the efficient parallelisation of the algorithm on the GPU require us to select the ‘Naive attractor’ implementation as the underlying algorithm (Algorithm 2) to be parallelised (leading to Algorithm 4) rather than the ‘Improved attractor’ implementation in [6]. The concepts of ‘Naive’ and ‘Improved’ attractors are presented by Arcucci et al in [6]. In short, the ‘Naive’ attractor loops over each node and checks if it can be included in the attractor set, and repeats this until no further nodes can be added. The ‘Improved’ attractor starts from the original attractor set, performing backpropagation on their inbound edges to find other nodes that may be included in the set.

GPU deployment. Our GPU implementation works by parallelising the “attract” operation.Footnote 4 Whilst the sequential version may be executed as such:

  • (Loop 1) While attracting new nodes...

    • (Loop 2) For each node, check if it can be included in the attractor set.

And the runtime operations may look like:

  • While attracting new nodes...

    • Can node 1 be included in the attractor set?

    • ...

    • Can node N be included in the attractor set?

  • If attracted new nodes, repeat loop. Else break.

Performance is found through the inner loop being efficiently parallelised on the GPU. Additional specifics include the following GPU deployment features. When asking “Can node X be included ...?”, the computation taking place is:

  • Let J be the set of nodes in the current attractor set.

  • Let K be the set of nodes that X can move to.

  • If X is on the “friendly” team, and \(K\cap J \ne \emptyset \), then \(J \leftarrow J \cup \{X\}\).

  • If X is on the “enemy” team, and \(K \subseteq J\), then \(J \leftarrow J \cup \{X\}\).

Key to our approach is that these operations are efficiently parallelised through means of matrix multiplication operations on the GPU. It is done as such:

  • Compute \(\textbf{t}=A\textbf{1}\). Hence, \(t_i\) is the number of nodes node i can move to.

  • Let \(\textbf{j}\) be a vector of size N (where N is the size of the parity game), such that \(\textbf{j}_i=1\) if and only if node i is in the current attractor set. Default 0.

  • Let A be an adjacency matrix (usually, a sparse matrix) of the parity game.

  • Compute the vector \(\textbf{k}=A\textbf{j}\). Hence, the value \(\textbf{k}_i\) in the vector is the number of nodes node i can move to and that are in the current attractor set.

  • Then, for each node i, if it is on the friendly team, and \(\textbf{k}_i \ne 0\), then \(\textbf{j}_i = 1\); otherwise, if it is on the enemy team, and \(\textbf{k}_i=\textbf{t}_i\), then \(\textbf{j}_i=1\).

Note we convert the previous logic on sets to suit the new form using vectors:

  • \(K \cap J \ne \emptyset \Leftrightarrow \textbf{k}_i \ne 0\) and \(K \subseteq J \Leftrightarrow \textbf{k}_i=\textbf{t}_i\).

Improved attractor implementation by Arucci et al [6]. The third parity game solver we evaluate is a custom, C++, implementation of the ZRA using the ‘Improved attractor’ algorithm in [6], originally implemented in JAVA there.

ZRA implementations in Oink [7]. The fourth and fifth implementations we evaluate and compare against are the most highly optimised implementation of ZRA developed in [7], and its unoptimised version – without pre-processing routines. We include this implementation since our matrix-based (‘Naive’) implementation is not optimised in terms of the pre-processing routines used for implementation. These solvers in Oink are referred to as zlk and uzlk in [7]. We note that the parallel implementation of this algorithm is not included since in [7] is shown that it usually is outperformed by zlk, which we include here.

4.1 Evaluation

The implementations evaluated in this paper were tested on a wide repository of parity games, and against state-of-the-art parity game solvers in the literature. The games used for performance evaluation include the suite by Keiren [12] (of games representing model checking and equivalence checking problems) and an additional set of variably sized random games generated by PGSolver [9].Footnote 5

We evaluate the performance in terms of solve time of each of the solvers and for each of the games. As it is common practice when evaluating different solvers for parity games, the overheads incurred due to startup and game loading are not included; this is done in order to obtain numbers that estimate only the running time of the algorithms, and nothing else. With the same aim, we ensured that at most one solver is running at any time, with CPU utilisation not exceeding more than one core. Finally, in order to allow for a fair comparison of running times only – rather than combining such results with the robustness of the algorithms – we measured the time solving an instance only in case all implementations successfully compute a solution. This allows for a fairer comparison with respect to runtime performance purely, because failing a game usually implies an extremely disproportionately (and arbitrary) high runtime. Such failures include timeouts (at 5 minutes) or being unable to load the game, sometimes due to factors having little to do with the running time of the algorithms. Our experiments were conducted in the Google Cloud Platform (GCP) using a T4 n1-highmem-2.Footnote 6

Profile of the input parity games. Our study includes more than 2000 parity games, with sizes ranging from only a few dozens of states to games with millions of states. Both nodes’ out-degrees and number of distinct priorities also cover a wide range of dimensions. However, both random games and structured ones (model checking and equivalence checking) typically are represented by sparse graphs, a feature that we will leverage for implementation purposes.

5 Analysis of results

As can be seen from Tables 1, 2, and 3, we evaluate the main five implementations, all of them following the ZRA philosophy, using two types of parity games: structured and random. Both types of benchmarks are as in [7] and [6], arguably, the two best implementations of ZRA. The focus of this evaluation is to understand the usefulness and scalability of the ‘GPU matrix’ algorithm, which is the one embodying more cleanly our working hypothesis, namely, that the combination of a matrix-based representation of ZRA and the use of modern GPU technologies can outperform the state of the art in the design of algorithms for parity games – a hypothesis for which we provide strong evidence here.

Table 1. Times are in milliseconds (ms) representing the average time taken to solve games that all implementations passed (i.e., if any implementation fails to solve a game, the game is excluded from the time average of all five solvers, including an additional GPU implementation on an RTX2060S, presented later). Failures occur with a small number of large equivalence checking games only. Failures include a few timeouts (at 5 mins), and usually being unable to load the game in memory due to hardware limitations posed by the GPU architectures. Columns P/F show the number of games passed/failed for every type of game.
Table 2. Results in this table are formatted as in Table 1. In this table, we report the performance (average time in milliseconds taken to solve a single game) for the 5 algorithms on large (>1M nodes) parity games only.
Table 3. Results in this table are formatted as in Table 1. In this table, we report the performance (average time in milliseconds taken to solve a single game) for the 5 algorithms on “small” (<1M nodes) parity games only: results for structured and random games appear in the top table and for random games (detailed) at the bottom. In the bottom table, there are 200 games per column, apart from column 640K which has 100 games; there are no failures.

The results above also show that going from the sequential version of our approach, ‘Naive (matrix) attractor’ to its parallel implementation using GPU technologies finds significant improvements. These two main “internal” results are then compared with the state of the art in the algorithmic design of solutions based on ZRA, namely, using the improved attractor in [6] and using the highly optimised procedure zlk in Oink [7], which even outperforms its own parallel implementation; cf. [7]. Finally, the unoptimised version in Oink of this procedure, uzlk, is also included simply because our matrix-based procedure does not contain any of the pre-processing routines that differentiate zlk from uzlk. Thus, in a way, uzlk provides results for a somewhat fairer comparison.

GPU matrix vs Naive (matrix) attractor. Results in all tables show that the parallel implementation using GPU technologies outperforms its own sequential implementation (‘Naive matrix attractor’) by several orders of magnitude, with some exceptions, usually ranging from 5 times faster in some cases (e.g., model checking of large games) to more than 10 times faster (e.g., model checking of small games). This, we believe, is due to the fact that the bigger the input instances to be analysed the more any losses in the associated overheads of running the procedure in parallel are compensated later on. A trend going in that direction can be observed in detail when comparing the performance of these two algorithms over small random games. But, in any case, our matrix-based approach is always at least as good as its sequential implementation.

GPU matrix vs Improved attractor. The results show that the parallel matrix-based approach can outperform the improved attractor procedure by Arcucci et al [6] by 2-7 orders of magnitude, depending on the type of game being solved, and with the best results obtained when solving random games, whether large or small. However, the sequential version of ‘GPU matrix’, that is, the Naive implementation, usually is twice slower than the improved attractor implementation in structured games. Contrarily, even the (sequential) Naive implementation of the matrix-based method outperforms the improved attractor procedure over random games, being about 30% overall in that case. When looking at all the tables of results together, one can see that this is in fact an indicator of the fact that the improved attractor approach performs somewhat poorly over random graphs, at least when compared to its performance over structured games.

GPU matrix vs Oink. Even thought the GPU matrix-based implementation outperforms Oink’s zlk, it usually does it only by a 1.5 to 2.0 factor, with the GPU implementation performing more efficiently over (large) random games than over structured ones. This result actually speaks very highly of the optimised sequential implementation of ZRA. However, as shown in [7], zlk performs even better than its own parallel implementation (called zlk-8 in [7]) when solving model checking parity games (by a very small margin) and when solving random games, where it is nearly twice faster; cf. Table 3 of [7]. Only when solving equivalence checking parity games zlk-8 outperforms zlk, but only by about a 13% margin. In contrast, the GPU implementation here outperforms zlk by more than a 70% margin, and is even twice faster when solving small equivalence checking games.

However, as we can see from all tables, the GPU matrix-based implementation has some failures (running timeout or failure to upload the game in memory, mainly due to their size), while the improved attractor method never fails in the considered set of benchmarks. This indicates that in this particular case, there may be a choice to be made between some potentially marginal gain in efficiency and more reliability offered by zlk. On the other hand, zlk clearly outperforms the sequential (Naive) implementation of the matrix-based approach, with better efficiency going from twice faster when solving random games to about four times faster when solving structured games. Regarding performance against Oink’s uzlk, all analyses above remain similar, only that a better factor is usually obtained in favour of the GPU matrix-based approach.

Improved attractor vs Oink’s zlk. Despite these two procedures being originally developed previously, we would like to comment on their comparative performance, for the sake of completeness of the analysis. As can be seen from our results, both offer the same reliability as they do not fail to solve any instance. Regarding runtime efficiency, we can observe that, on average, Oink’s zlk implementation tends to be 1.5 to 3.0 times faster than the improved attractor method, with the worst/best comparative performance being enacted when solving model checking/random parity game instances, and in that way making zlk perhaps the most efficient sequential implementation of ZRA currently available in the literature, and being outperformed only when a parallel approach is considered.

6 Special cases

In this section, we analyse in more details two special cases of our results: performance when solving large parity games and performance on random games.

6.1 Solving large parity games

For the purposes of this section, a large parity game is a game with more than 1 million nodes. Our results show that for games that are not large (Table 3), all solvers may be regarded as running efficiently from a human perspective, with some random games with more than 500K nodes being solved in about half a second by the slowest implementation on random games (the improved attractor implementation). In most other instances, solutions may be obtained in just a few milliseconds. For instance, model checking parity games in the suite of benchmarks can be solved in less than 0.1 minutes by any studied solver, and even in less than 10 milliseconds on average using the parallel GPU matrix-based approach, with Oink implementation taking virtually the same time (just a little more than 10 milliseconds on average). Then, the real challenge when solving parity games in practice is solving large parity games, where the relative performances between different solvers can be much better exposed (Table 2).

Our results show (Tables 1 and 2) that, despite the raw data being different in about 9 orders of magnitude, nearly the same relative performance is obtained when looking at performance over all games with respect to performance over large games only, which account for no more than 15% of the games for equivalence checking games, 10% for model checking games and less than 5% for random games. This result indicates that in order to evaluate the performance of parity games solvers in practice, one should better focus on large games only. As the data shows, in that case that parallel GPU matrix-based approach outperforms the second-best technique by, approx., a 1.5-2.0 factor, and its own sequential implementation by a factor of 4 to 5, in each case, depending on the type of parity game under consideration. The analysis holds across all solvers.

6.2 Solving random parity games

Random parity games are a common benchmark for parity games solvers, being the focus of the study on [6]. Our detailed experiments on random parity games show that the parallel GPU implementation of the matrix-based approach is comparable to the parallel implementation of the improved attractor implementation in [6] (see Table 3 there), in the sense that a similar relative gain in performance is achieved, overall, performing about 3.5-4.0 times faster over random games of up to 20K nodes. The gain in performance increases in our case when considering larger random graphs, perhaps indicating that our approach may be more scalable in terms of running time; however, in [6], only results on random games of up to 20K nodes are presented. We note that, in this case, only by changing the programming language of choice (JAVA in [6] and C++ here), performance is improved going from games of 20K size being solved in more than 5 seconds to the same type of games being solved in just 7ms on average here.

7 Alternative implementations

In this section, we explore two alternative implementations, one focused on a change of programming environment and another one based on a change of computer architecture. Our results show that while the former is well outperformed by the original C++ implementation, the latter shows even better performance than the already reported can be achieved when using other GPU technologies.

A MATLAB implementation. Given its facility to perform matrix operations, we investigated a MATLAB of our matrix-based approach to understand if it could perform better than our original C++ implementation. The results were negative. The MATLAB implementation of our approach, although simple, performed significantly worse than other methods, including our own using C++. A summary of the results, which require little discussion, can be found in Table 4.

Table 4. Results in this table are formatted as in Table 1. We report results on all games, and in each case, independently, remove the time of unsolved instances.

Using a different GPU technology. We conducted experiments using the exact same implementation of the GPU matrix solver (run on a GCP) on a different GPU architecture, namely, on an RTX2060 Super (Ryzen 5 3600). We found that by simply changing to this alternative hardware specification, the results on all types of games were significantly better, as shown in the Table 5.

Table 5. Results in this table are formatted as in Table 1. We report results on all games, which show an improvement of a 1.5x factor for structured games, while performing approximately 25% slower over random parity games.

8 Concluding remarks and related work

We have shown that a new method for solving parity games using a matrix-based approach can outperform the state-of-the-art techniques, both sequential and parallel, currently available. As such, our results become a new point of comparison when evaluating modern solvers for parity games. Previous research [6, 7, 9, 17] has shown that ZRA is potentially the best performing algorithm to solve parity games in practice, and here we provide more evidence that this is indeed the case. We also give evidence that C++ implementations for this task are hardly ever outperformed in practice. Finally, we also show that choosing the right computer architecture is key to achieve optimal performance, and in particular that in the case of modern GPU technologies, such a choice can make a significant difference in practice – in our study, leading to the development of the, as of today, most efficient parallel implementation/solver for parity games.