figure a
figure b

1 Introduction

Hyperproperties [16] are system properties that relate multiple executions of a system. Such properties are of increasing importance as they naturally occur, e.g., in information-flow control [36], robustness [22], linearizability [30, 31], path planning [39], mutation testing [27], and causality checking [18]. A prominent logic to express hyperproperties is HyperLTL, which extends linear-time temporal logic (LTL) with explicit trace quantification [15]. HyperLTL can, for instance, express generalized non-interference (GNI) [34], stating that the high-security input of a system does not influence the observable output.

figure c

Here, H is a set of high-security input, L is a set of low-security inputs, and O is a set of low-security outputs. The formula states that for any traces \(\pi , \pi '\) there exists a third trace \(\pi ''\) that agrees with the high-security inputs of \(\pi \) and with the low-security inputs and outputs of \(\pi '\). Any observation made by a low-security attacker is thus compatible with every possible high-security input.

We are interested in the model checking (MC) problem of HyperLTL, i.e., whether a given (finite-state) system satisfies a given property. For HyperLTL, the structure of the quantifier prefix directly impacts the complexity of this problem. For alternation-free formulas (i.e., formulas that only use quantifiers of a single type), verification is well understood and is reducible to the verification of a trace property on a self-composition of the system [3]. This reduction has, for example, been implemented in MCHyper [29], a tool that can model check (alternation-free) HyperLTL formulas in systems of considerable size (circuits with thousands of latches).

Verification is much more challenging for properties involving quantifier alternations (such as GNI from above). While MC algorithms supporting full HyperLTL exist (see [15, 29]), they have not been implemented yet. Instead, over the years, a number of approaches to the verification of such properties in practice have been made: Finkbeiner et al. [29] and D’Argenio et al. [22] manually strengthen properties with quantifier alternation into properties that are alternation-free and can be checked by MCHyper. Coenen et al. [19] instantiate existential quantification in a \(\forall ^*\exists ^*\) property (i.e., a property involving an arbitrary number of universal quantifiers followed by an arbitrary number of existential quantifiers, such as GNI) with an explicit (user-provided) strategy, thus reducing to the verification of an alternation-free formula. Alternatively, the strategy that resolves existential quantification can be automatically synthesized [7]. Hsu et al. [31] present a bounded model checking (BMC) approach for HyperLTL that is implemented in HyperQube. See Section 4 for more details.

While all these verification tools can verify (or refute) interesting properties, they all suffer from the same fundamental limitation: they are incomplete. That is, for all the tools above, we can come up with verification instances where they fail, not because of resource constraints but because of inherent limitations in the underlying verification algorithm. Moreover, such instances are not rare events but are encountered regularly in practice. For example, many of the benchmarks used to evaluate HyperQube (by Hsu et al. [31]) do not admit a strategy to resolve existential quantification. Conversely, many of the properties verified by Coenen et al. [19] (such as GNI) cannot be verified using BMC [31].

AutoHyper. In this paper, we present AutoHyper, a model checker for HyperLTL. Our tool checks a hyperproperty by iteratively eliminating trace quantification using automata-complementations, thereby reducing verification to the emptiness check of an automaton [29]. Importantly – and different from previous tools for HyperLTL verification such as MCHyper [19, 29] and HyperQube [31] – AutoHyper can cope with (and is complete for) arbitrary HyperLTL formulas. Model checking using AutoHyper does not require manual effort (such as writing an explicit strategy in MCHyper [19]), nor does a user need to worry if the given property can even be verified with a given method. AutoHyper thus provides a “push-button” model checking experience for HyperLTL.Footnote 1

To improve AutoHyper’s efficiency, we make the (theoretical) observation that we can often avoid explicit automaton complementation and instead reduce to a language inclusion check on Büchi automata (cf. Proposition 1). On the practical side, this enables AutoHyper to resort to a range of mature language inclusion checkers, including spot [26], RABIT [17], BAIT [25], and FORKLIFT [24].

Evaluation. Using AutoHyper, we extensively study the practical aspects of model checking HyperLTL properties with quantifier alternations. To evaluate the performance of explicit-state model checking, we apply AutoHyper to a broad range of benchmarks taken from the literature and compare it with existing (incomplete) tools. We make the surprising observation that – at least on the currently available benchmarks – explicit-state MC as implemented in AutoHyper performs on-par (and frequently outperforms) symbolic methods such as BMC [31]. Our benchmarks stem from various areas within computer science, so AutoHyper should – thanks to its “push-button” functionality, completeness, and ease of use – be a valuable addition to many areas.

Apart from using AutoHyper as a practical MC tool, we can also use it as a complete baseline to systematically evaluate existing (incomplete) methods. For example, while it is known that replacing existential quantification with a strategy (as done by Coenen et al. [19]) is incomplete, it was, thus far, unknown if this incompleteness occurs frequently or is merely a rare phenomenon. We use AutoHyper to obtain a ground truth and evaluate the strategy-based verification approach in terms of its effectiveness (i.e., how many instances it can verify despite being incomplete) and efficiency.

Structure. The remainder of this paper is structured as follows. In Section 2, we introduce HyperLTL. We recap automata-based verification (which we abbreviate ABV) and our new approach utilizing language inclusion checks in Section 3. We discuss alternative verification approaches for HyperLTL in Section 4. In Section 6, we compare different backend solving techniques and study the complexity of HyperLTL MC with multiple quantifier alternations in practice; In Section 7, we evaluate ABV on a set of benchmarks from the literature and compare with the bounded model checker HyperQube [31]; In Section 8 we use AutoHyper for a detailed analysis of (and comparison with) strategy-based verification [7, 19].

2 Preliminaries

We fix a set of atomic propositions \( AP \) and define \(\varSigma := 2^ AP \). HyperLTL [15] extends LTL with explicit quantification over traces, thereby lifting it from a logic expressing trace properties to one expressing hyperproperties [16]. Let \(\mathcal {V}\) be a set of trace variables. We define HyperLTL formulas by the following grammar:

figure d

where \(\pi \in \mathcal {V}\) and \(a \in AP \).

We assume that the formula is closed, i.e., all trace variables that are used in the body are bound by some quantifier. The semantics of HyperLTL is given with respect to a trace assignment \(\varPi : \mathcal {V}\rightharpoonup \varSigma ^\omega \) mapping trace variables to traces. For \(\pi \in \mathcal {V}\) and \(t \in \varSigma ^\omega \), we write \(\varPi [\pi \mapsto t]\) for the trace assignment obtained by updating the value of \(\pi \) to t. Given a set of traces \(\mathbb {T}\subseteq \varSigma ^\omega \), a trace assignment \(\varPi \), and \(i \in \mathbb {N}\), we define:

figure e

A transition system is a tuple \(\mathcal {T}= (S, S_0, \kappa , L)\) where S is a set of states, \(S_0 \subseteq S\) is a set of initial states, \(\kappa \subseteq S \times S\) is a transition relation, and \(L : S \rightarrow \varSigma \) is a labeling function. We write whenever \((s, s') \in \kappa \). A path is an infinite sequence \(s_0s_1s_2 \cdots \in S^\omega \), s.t., \(s_0 \in S_0\), and for all i. The associated trace is given by \(L(s_0)L(s_1)L(s_2) \cdots \in \varSigma ^\omega \). We write \( Traces (\mathcal {T}) \subseteq \varSigma ^\omega \) for the set of all traces generated by \(\mathcal {T}\). We say \(\mathcal {T}\) satisfies a HyperLTL property \(\varphi \), written \(\mathcal {T}\models \varphi \), if \(\emptyset \models _{ Traces (\mathcal {T})} \varphi \), where \(\emptyset \) denotes the empty trace assignment.

3 Automata-based HyperLTL Model Checking

Given a system \(\mathcal {T}\) and HyperLTL property \(\varphi \), we want to decide whether \(\mathcal {T}\models \varphi \). In this section, we recap the automata-based approach to the model checking of HyperLTL [29]. We further show how language inclusion checks can be incorporated into the model checking procedure to make use of a broad collection of mature language inclusion checkers.

3.1 Automata-based Verification

The idea of automata-based verification (ABV) [29] is to iteratively eliminate quantifiers and thus reduce MC to the emptiness check on an automaton. A non-deterministic Büchi automaton (NBA) is a tuple \(\mathcal {A}= (Q, Q_0, \delta , F)\) where Q is a finite set of states, \(Q_0 \subseteq Q\) is a set of initial states, \(\delta : Q \times \varSigma \rightarrow 2^Q\) is a transition function, and \(F \subseteq Q\) is a set of accepting states. We write \(\mathcal {L}(\mathcal {A}) \subseteq \varSigma ^\omega \) for the language of \(\mathcal {A}\), i.e., all infinite words that have a run that visits states in F infinitely many times (see, e.g., [2]). For traces \(t_1, \ldots , t_n \in \varSigma ^\omega \), we write \( zip (t_1, \ldots , t_n) \in (\varSigma ^n)^\omega \) as the pointwise product, i.e., \( zip (t_1, \ldots , t_n)(i) := (t_1(i), \ldots , t_n(i))\).

Let \(\mathcal {T}= (S, S_0, \kappa , L)\) be a fixed transition system and let \(\dot{\varphi }\) be some fixed closed HyperLTL formula (we use the dot to refer to the original formula and use \(\varphi , \varphi '\) to refer to subformulas of \(\dot{\varphi }\)). For some subformula \(\varphi \) that contains free trace variables \(\pi _1, \ldots , \pi _n\), we say an NBA \(\mathcal {A}\) over \(\varSigma ^n\) is \(\mathcal {T}\)-equivalent to \(\varphi \), if for all traces \(t_1, \ldots , t_n\) it holds that \([\pi _1 \mapsto t_1, \ldots , \pi _n \mapsto t_n] \models _{ Traces (\mathcal {T})} \varphi \) iff \( zip (t_1, \ldots , t_n) \in \mathcal {L}(\mathcal {A})\). That is, \(\mathcal {A}\) accepts exactly the zippings of traces that constitute a satisfying trace assignment for \(\varphi \).

To check if \(\mathcal {T}\models \dot{\varphi }\), we inductively construct an automation \(\mathcal {A}_\varphi \) that is \(\mathcal {T}\)-equivalent to \(\varphi \) for each subformula \(\varphi \) of \(\dot{\varphi }\). For the (quantifier-free) LTL body of \(\dot{\varphi }\), we can construct this automaton via a standard LTL-to-NBA construction [2, 29]. Now consider some subformula \(\varphi ' = \exists \pi . \varphi \) where \(\varphi '\) contains free trace variables \(\pi _1, \ldots , \pi _n\) and so \(\varphi \) contains free trace variables \(\pi _1, \ldots , \pi _n, \pi \). We are given an inductively constructed NBA \(\mathcal {A}_{\varphi } = (Q, Q_0, \delta , F)\) over \(\varSigma ^{n+1}\) that is \(\mathcal {T}\)-equivalent to \(\varphi \). We define the automaton \(\mathcal {A}_{\varphi '} \) over \(\varSigma ^n\) as \(\mathcal {A}_{\varphi '} := (S \times Q, S_0 \times Q_0, \delta ', S \times F)\) where \(\delta '\) is defined as

$$\begin{aligned} \delta '\Big ((s, q), \big \langle l_1, \ldots , l_n\big \rangle \Big ) := \Big \{(s', q') \mid s \xrightarrow {\mathcal {T}} s' \;\wedge \; q' \in \delta \big (q, \big \langle l_1, \ldots , l_n, L(s)\big \rangle \big ) \Big \}. \end{aligned}$$

Informally, \(\mathcal {A}_{\varphi '}\) reads the zippings of traces \(t_1, \ldots , t_n\) and guesses a trace \(t \in Traces (\mathcal {T})\) such that \( zip (t_1, \ldots , t_n, t) \in \mathcal {L}(\mathcal {A}_{\varphi })\). It is easy to see that \(\mathcal {A}_{\varphi '}\) is \(\mathcal {T}\)-equivalent to \(\varphi '\). To handle universal trace quantification, we consider a formula \(\varphi ' = \forall \pi . \varphi \) as “\(\varphi ' = \lnot \exists \pi . \lnot \varphi \)” and combine the construction for existential quantification with an automaton complementation.

Following the inductive construction, we obtain an automaton \(\mathcal {A}_{\dot{\varphi }}\) over the singleton alphabet \(\varSigma ^0\) that is \(\mathcal {T}\)-equivalent to \(\dot{\varphi }\). By definition of \(\mathcal {T}\)-equivalence, \(\mathcal {T}\models {\dot{\varphi }}\) iff \(\emptyset \models _{ Traces (\mathcal {T})} \dot{\varphi }\) iff \(\mathcal {A}_{\dot{\varphi }}\) is non-empty (which we can decide [21]).

3.2 HyperLTL Model Checking by Language Inclusion

The algorithm outlined above requires one complementation for each quantifier alternation in the HyperLTL formula. While we cannot avoid the theoretical cost of this complementation (see [15, 36]), we can reduce to a, in practice, more tamable problem: language inclusion.

For a system \(\mathcal {T}\), and a natural number \(n \in \mathbb {N}\) we define \(\mathcal {A}_\mathcal {T}^n\) as an NBA over \(\varSigma ^n\) such that for any traces \(t_1, \ldots , t_n \in \varSigma ^\omega \) we have \( zip (t_1, \ldots , t_n) \in \mathcal {L}(\mathcal {A}_\mathcal {T}^n)\) if and only if \(t_i \in Traces (\mathcal {T})\) for every \(1 \le i \le n\). We can construct \(\mathcal {A}_\mathcal {T}^n\) by building the n-fold self-composition of \(\mathcal {T}\) [3] and convert this to an automaton by moving the labels from states to edges and marking all states as accepting. We can now state a formal connection between language inclusion and HyperLTL MC (a proof can be found in the full version [9]):

Proposition 1

Let \(\dot{\varphi } = \forall \pi _1. \ldots \forall \pi _n. \varphi \) be a HyperLTL formula (where \(\varphi \) may contain additional trace quantifiers) and let \(\mathcal {A}_\varphi \) be an automaton over \(\varSigma ^n\) that is \(\mathcal {T}\)-equivalent to \(\varphi \). Then \(\mathcal {T}\models \dot{\varphi }\) if and only if \(\mathcal {L}(\mathcal {A}_\mathcal {T}^n) \subseteq \mathcal {L}(\mathcal {A}_\varphi )\).

We can use Proposition 1 to avoid a complementation for the outermost quantifier alternation. For example, assume \(\dot{\varphi } = \forall \pi _1. \forall \pi _2. \exists \pi _3. \psi \) where \(\psi \) is quantifier-free. Using the construction from Section 3.1, we obtain an automaton \(\mathcal {A}_{\exists \pi _3. \psi }\) that is \(\mathcal {T}\)-equivalent to \(\exists \pi _3. \psi \) (we can construct \(\mathcal {A}_{\exists \pi _3. \psi }\) in linear time in the size of \(\mathcal {T}\)). By Proposition 1, we then have \(\mathcal {T}\models \dot{\varphi }\) iff \(\mathcal {L}(\mathcal {A}^2_\mathcal {T}) \subseteq \mathcal {L}(\mathcal {A}_{\exists \pi _3. \psi })\).

Note that complementation and subsequent emptiness check is a theoretically optimal method to solve the (PSPACE-complete) language inclusion problem. Proposition 1 thus offers no asymptotic advantages over “standard” ABV in Section 3.1. In practice constructing an explicit complemented automaton is often unnecessary as the language inclusion or non-inclusion might be witnessed without a complete complementation [17, 24,25,26]. This makes Proposition 1 relevant for the present work and the performance of AutoHyper.

4 Related Work and HyperLTL Verification Approaches

HyperLTL [15] is the most studied logic for expressing hyperproperties. A range of problems from different areas in computer science can be expressed as HyperLTL MC problems, including (optimal) path panning [39], mutation testing [27], linearizability [31], robustness [22], information-flow control [36], and causality checking [18], to name only a few. Consequently, any model checking tool for HyperLTL is applicable to many disciples within computer science and provides a unified solution to many challenging algorithmic problems. In recent years, different (mostly incomplete) methods for the verification of HyperLTL have been developed. We discuss them below (see the full version [9] for details).

Automata-based Model Checking. Finkbeiner et al. [29] introduce the automata-based model checking approach as presented in Section 3.1. For alternation-free formulas, the algorithms corresponds to the construction of the self-composition of a system [3] and is implemented in the MCHyper tool [29]. MCHyper can handle systems of significant size (well beyond the reach of explicit-state methods) but is unable to handle any quantifier alternation (the main motivation for AutoHyper). htltl2mc [15] is a prototype model checker for HyperLTL\(_2\) (a fragment of HyperLTL with at most one alternation) built on top of GOAL [38]. In contrast to htltl2mc, AutoHyper supports properties with arbitrarily many quantifier alternations and features automata with symbolic alphabets – which is important to handle large systems with many atomic propositions, cf. Footnote 7.

Strategy-based Verification. Coenen et al. [19] verify \(\forall ^*\exists ^*\) properties by instantiating existential quantification with an explicit strategy. This method – which we refer to as strategy-based verification (SBV) – comes in two flavors: either the strategy is provided by the user or the strategy is synthesized automatically. In the former case, model checking reduces to checking an alternation-free formula and can thus handle large systems, but requires significant user effort (and is thus no “push-button” technique). In the latter case, the method works fully automatically [7, 8] but requires an expensive strategy synthesis. SBV is incomplete as the strategy resolving existentially quantified traces only observes finite prefixes of the universally quantified traces. While SBV can be made complete by adding prophecy variables [7], the automatic synthesis of such prophecies is currently limited to very small systems and properties that are temporally safe [5]. We investigate both the performance and incompleteness of SBV in Section 8.

Bounded Model Checking. Hsu et al. [31] propose a bounded model checking (BMC) procedure for HyperLTL. Similar to BMC for trace properties [11], the system is unfolded up to a fixed depth, and pending obligations beyond that depth are either treated pessimistically (to show the satisfaction of a formula) or optimistically (to show the violation of a formula). While BMC for trace properties reduces to SAT-solving, BMC for hyperproperties naturally reduces to QBF-solving. As usual for bounded methods, BMC for HyperLTL is incomplete. For example, it can never show that a system satisfies a hyperproperty where the LTL body contains an invariant (as, e.g., is the case for GNI).Footnote 2 We compare AutoHyper and BMC (in the form of HyperQube [31]) in Section 7.

5 AutoHyper: Tool Overview

AutoHyper is written in F# and implements the automata-based verification approach described in Section 3.1 and, if desired by the user, makes use of the language-inclusion-based reduction from Section 3.2. AutoHyper uses spot [26] for LTL-to-NBA translations and automata complementations. To check language inclusion, AutoHyper uses spot (which is based on determinization), RABIT [17] (which is based on a Ramsey-based approach with heavy use of simulations), BAIT [25], and FORKLIFT [24] (both based on well-quasiorders). AutoHyper is designed such that communication with external automata tools is done via established text-based formats (opposed to proprietary APIs), namely the HANOI [1] and BA automaton formats. New (or updated) tools that improve on fundamental automata operations, such as complementation and inclusion checks, can thus be integrated easily. Internally we represent automata using symbolic alphabets (similar to spot). We store transition formulas as DNFs as this allows for very efficient SAT checks, which are needed during the product construction.

All experiments in this paper were conducted on a Mac Mini with an Intel Core i3 (i3-8100B) and 16GB of memory. We used spot version 2.11.1; RABIT version 2.4.5; BAIT commit 369e1a4; and FORKLIFT commit 5d519e3.

Input Formats. AutoHyper supports both explicit-state systems (given in a HANOI-like [1] input format) and symbolic systems that are internally converted to an explicit-state representation. The support for symbolic systems includes Aiger circuits, symbolic models written in a fragment of the NuSMV input language [13], and a simple boolean programming language [6].

Random Benchmarks. For our evaluation, we use both existing instances from various sources in the literature and randomly generated problems.Footnote 3 We generate random transition systems based on the Erdős–Rényi–Gilbert model [28]. Given a size n and a density parameter \(p \in [0, 1]\), we generate a graph with n states, where for every two states \(s, s'\), there is a transition \(s \rightarrow s'\) with probability p. To generate a graph with n edges and, in expectation, constant outdegree of k, we can choose \(p = \tfrac{k}{n}\). We further ensure that the system is connected and all states have at least one outgoing edge. We generate random HyperLTL formulas (with a given quantifier prefix) by sampling the LTL matrix using spot’s randltl.

6 HyperLTL Model Checking Complexity in Practice

Before we turn our attention to benchmarks found in the literature, we compare the different backend inclusion checkers supported by AutoHyper by evaluating them on a large set of synthetic (random) benchmarks (in Section 6.1). Moreover, the random generation of benchmarks allows us to peek at formulas with more than one quantifier alternation. The theoretical hardness of model checking properties with multiple alternations has been studied extensively [15, 36], and we analyze, for the first time, how these results transfer to practice (in Section 6.2).

6.1 Performance of Inclusion Checkers

As the first set of benchmarks, we compare the different backend inclusion checkers supported by AutoHyper. In Figure 1, we depict how many instances can be solved using the inclusion checks of spot, BAIT, RABIT, and FORKLIFT within a timeout of 10s and give the median running time used on the instances that could be solved within the timeout. We observe that spot clearly outperforms RABIT, BAIT, and FORKLIFT in terms of the percentage of instances that can be checked within 10s.Footnote 4 While, in general, spot solves the most instances, a manual inspection reveals that there are also instances that can only be solved by RABIT or BAIT/FORKLIFT. This justifies why AutoHyper supports multiple backed inclusion checkers that implement different algorithms and thus excel on different problems (we will confirm this in Section 7). Moreover, our experiments provide evidence that HyperLTL MC is a natural source for challenging language inclusion benchmarks (see the full version [9]).

Fig. 1.
figure 1

We evaluate different backend solvers on instances of varying system size with an (on average) constant outdegree of 10 and a fixed property size of 20. We generate 20 samples per system size. We display both the success rate of each solver within a timeout of 10s (on the left axis) and the median running time on the solved instances (on the right axis).

Fig. 2.
figure 2

For properties with a varying number of quantifier alterations, we display the average time spent on the automata complementation during model checking.

We remark that we set the timeout of 10s deliberately low to compute (and reproduce) the plots in a reasonable time (computing Figure 1 took about 3.5h). If a user wants to verify a given instance and does not require a result within a few seconds, running the solver for even longer will likely increase the success rate further (see also the evaluation in Section 7).

6.2 Model Checking Beyond \(\forall ^*\exists ^*\)

Using randomly generated benchmarks, we can also peek at the practical complexity of model checking in the presence of multiple quantifier alternations. In theory, the model checking complexity of HyperLTL increases by one exponent with each quantifier alternation [15, 36]. Using AutoHyper, we can, for the first time, investigate the model checking complexity in practice.

We model check randomly generated formulas with 1 to 4 quantifier alternations and visualize the total running time based on the cost of each complementation (using spot) in Figure 2 (recall that checking a formula with k alternations using ABV requires k automaton complementations). Although the number of quantifier alternations has an undeniable impact on the total running time (the cumulative height of each bar), the increase in runtime is not proportional to the (non-elementary) increase suggested by the theoretical analysis. Different from the theoretical analysis (where the \((k+1)\)th complementation is exponentially more expensive than the kth), the cost of each complementation barely increases (or even decreases). This suggests that the \(\mathcal {T}\)-equivalent automata constructed in each iteration are, in practice, much smaller than indicated by the worst-case theoretical analysis. Verification of properties beyond one alternation is thus less infeasible than the theory suggests (at least on randomly generated test cases).

7 Evaluation on Symbolic Systems

In this section, we challenge AutoHyper with complex model checking problems found in the literature. Our benchmarks stem from a range of sources, including non-interference in boolean programs [6], symmetry in mutual exclusion algorithms [19], non-interference in multi-threaded programs [37], fairness in non-repudiation protocols [32], mutation testing [27], and path planning [39].

Table 1. We depict the running time of AutoHyper when verifying GNI on the boolean programs taken from [6] and [10]. We give the program, the bitwidth (bw), the size of the intermediate explicit-state representation (Size), and the time taken by each solver. The timeout is set to 60s and indicated by a “-”. The property holds in all cases. Times are given in seconds.

7.1 Model Checking GNI on Boolean Programs

We use AutoHyper to verify GNI on a range of boolean programs that process high-security and low-security inputs (taken from [6, 10]). Table 1 depicts the runtime results using different backend solvers. We test each program with varying bitwidth and depict the largest bitwidth that can be solved by at least one solver (within a timeout of 60s). We, again, note that spot performs better than other inclusion checkers and, in particular, scales better when the size of the system increases. Note that the number of atomic propositions is 3 in all instances, so spot’s support for symbolic alphabets has a negligible impact on the running time. We emphasize that not all instances in Table 1 can be verified using SBV [7, 19] without a user-provided fixed lookahead. Likewise, BMC [31] can never verify GNI. This provides further evidence why complete model checking tools (of which AutoHyper is the first) are necessary.

Table 2. We evaluate HyperQube and AutoHyper on the benchmarks from [31]. We list the system and the property (as given in [31, Table 2]), the quantifier structure (\(\boldsymbol{Q}^*\)), the verification result (Res) (✓ indicates that the property holds and ✗ that it is violated), and the total running time of either tool (t). For HyperQube, we additionally list the unrolling bound (k) and the unrolling semantics (Sem). For AutoHyper, we additionally list the size of the intermediate explicit state space (Size). Times are given in seconds.

7.2 Explicit Model Checking of Symbolic Systems

In this section, we evaluate AutoHyper on challenging symbolic models (NuSMV models [13]) that were used by Hsu et al. [31] to evaluate HyperQube.

The properties we verify cover a wide range of properties. For example, we verify that Lamport’s bakery algorithm [33] does not satisfy various symmetry properties (as the algorithm prioritizes processes with a lower ticket ID); We check linearizabilityFootnote 5 [30] on the SNARK datastructure [23] and identify a previously known bug; And, we generate model-based mutation test cases using the approach proposed by Fellner et al. [27]. Further details on the benchmarks are provided in [31].

We check each instance using both HyperQube and AutoHyper and depict the results in Table 2.Footnote 6 When using AutoHyper we always apply spot’s inclusion checker.Footnote 7 For HyperQube we use the unrolling semantics and unrolling depth listed in [31, Table 2]. We observe that for most instances – despite using explicit state methods and thus being complete (cf. Section 7.4) – AutoHyper performs on par with HyperQube. On instances using Lamport’s bakery algorithm, BMC only needs to unroll to very shallow depths, resulting in very efficient solving, whereas AutoHyper’s running time is dominated by spot’s LTL-to-NBA translation (consuming up to 98% of the total time). Conversely, on the large SNARK example, AutoHyper performs significantly better.

7.3 Hyperproperties for Path Planning

As a last set of benchmarks, we use planning problems for robots encoded into HyperLTL as proposed by Wang et al. [39]. For example, the synthesis of a shortest path can be phrased as a \(\exists \forall \) property that states that there exists a path to the goal such that all alternative paths to the goal take at least as long. Wang et al. [39] propose a solution to check the resulting HyperLTL property by encoding it in first-order logic, which is then solved by an SMT solver. While not competitive with state-of-the-art planning tools, HyperLTL allows one to express a broad range of problems (shortest path, path robustness, etc.) in a very general way. Hsu et al. [31] observe that the QBF encoding implemented in HyperQube outperforms the SMT-based approach by Wang et al. [39]. In this section, we evaluate AutoHyper on these planning-hyperproperties and compare it with HyperQubeFootnote 8.

We depict the results in Table 3. It is evident that AutoHyper outperforms HyperQube, sometimes by orders of magnitude. This is surprising as planning problems (which are essentially reachability problems) on symbolic systems should be advantageous for symbolic methods such as BMC. The large size of the intermediate QBF indicates that a more optimized encoding (perhaps specific to path planning) could improve the performance of BMC on such examples.

Table 3. We evaluate HyperQube and AutoHyper on hyperproperties that encode the existence of a shortest path (\(\varphi _ sp \)) and robust path (\(\varphi _ rp \)). We give the specification (Spec), the size of the grid (Grid), and the times taken by HyperQube and AutoHyper (t). For HyperQube, we additionally give the unrolling depth used (k) and the file size of the QBF generated (|QBF|). For AutoHyper, we additionally give the size of the generated explicit state space (Size). Times are given in seconds. The timeout is set to 20 min and indicated by a “-”.

7.4 Bounded vs. Explicit-State Model Checking

Bounded model checking has seen remarkable success in the verification of trace properties and frequently scales to systems whose size is well out of scope for explicit-state methods [20]. Similarly, in the context of alternation-free hyperproperties, symbolic verification tools such as MCHyper [29] (which internally reduces to the verification of a circuit using ABC [12]) can verify systems that are well beyond the reach of explicit-state methods. In contrast, in the context of model checking for hyperproperties that involve quantifier alternations, our findings make a strong case for the use of explicit-state methods (as implemented in AutoHyper):

First, compared to symbolic methods (such as BMC), explicit-state model checking is currently the only method that is complete. While BMC was able to verify or refute all properties in Tables 2 and 3, many instances cannot be solved with the current BMC encoding. As a concrete example, BMC can never verify formulas whose body contains simple invariants (such as GNI) and can thus not verify any of the instances in Table 1. The most significant advantage of explicit-state MC (as implemented in AutoHyper) is thus that it is both push-button and complete, i.e., it can – at least in theory – verify or refute all properties.

Second, the performance of AutoHyper seems to be on-par with that of BMC and frequently outperforms it (even by several orders of magnitude, cf. Table 3). We stress that this is despite the fact that for the evaluation of HyperQube we already fix an unrolling depth and unrolling semantics, thus creating favorable conditions for HyperQube.Footnote 9 While BMC for trace properties reduces to SAT solving, BMC of hyperproperties reduces to QBF solving; a problem that is much harder and has seen less support by industry-strength tools. It is, therefore, unclear whether the advance of modern QBF solvers can improve the performance of hyperproperty BMC, to the same degree that the advance of SAT solvers has stimulated the success of BMC for trace properties. Our findings seem to indicate that, at the moment, QBF solving (often) seems inferior to an explicit (automata-based) solving strategy.

8 Evaluating Strategy-based Verification

So far, we have used AutoHyper to check hyperproperties on instances arising in the literature. In this last section, we demonstrate that AutoHyper also serves as a valuable baseline to evaluate different (possibly incomplete) verification methods. Here we focus on strategy-based verification (SBV), i.e., the idea of automatically synthesizing a strategy that resolves existential quantification in \(\forall ^*\exists ^*\) HyperLTL properties [7, 19].

8.1 Effectiveness of Strategy-based Verification

Fig. 3.
figure 3

We compare ABV (AutoHyper) and SBV ([7]) on instances of varying system size. We fix the property size to 20. We generate 100 random instances for each size and take the average over the fastest L instances, where L is the minimum number of instances solved within a 5s timeout by both methods.

SBV is known to be incomplete [7, 19]. However, due to the previous lack of complete tools for verifying \(\forall ^*\exists ^*\) properties, a detailed study into how effective SBV is in practice was impossible on a larger scale (i.e., beyond hand-crafted examples). With AutoHyper, we can, for the first time, rigorously evaluate SBV. We use the SBV implementation from [7], which synthesizes a strategy for the \(\exists \)-player by translating the formula to a deterministic parity automaton (DPA) [35] and phrases the synthesizes as a parity game.

We have generated random transition systems and properties of varying sizes and computed a ground truth using AutoHyper. We then performed SBV (recall that SBV can never show that a property does not hold and might fail to establish that it does). We find that for our generated instances, the property holds in 61.1% of the cases, and SBV can verify the property in 60.4% of the cases. Successful verification with SBV is thus possible in many cases, even without the addition of expensive mechanisms such as prophecies [7]. On the other hand, our results show that random generation produces instances (albeit not many) on which SBV fails (so far, examples where SBV fails required careful construction by hand). Reverting to SBV as the default verification strategy is thus not possible, further strengthening the case for complete model checking tools (of which AutoHyper is the first).

8.2 Efficiency of Strategy-based Verification

After having analyzed the effectiveness of SBV (i.e., how many instances can be verified), we turn our attention to the efficiency of SBV. In theory, (automata-based) model checking of \(\forall ^*\exists ^*\) HyperLTL – as implemented in AutoHyper – is EXPSPACE-complete in the specification and PSPACE-complete in the size of the system [15, 36]. Conversely, SBV is 2-EXPTIME-complete in the size of the specification but only PTIME in the size of the system [19]. Consequently, one would expect that ABV fares better on larger specifications and SBV fares better on larger systems (the more important measure in practice).

However, in this section, we show that this does not translate into practice (at least using the current implementation of SBV [7]). We compare the running time of AutoHyper (ABV) (using spot’s inclusion checker) and SBV. We break the running time into the three main steps for each method. For ABV, this is the LTL-to-NBA translation, the construction of the product automaton, and the inclusion check. For SBV, it is the LTL-to-DPA translation, the construction of the game, and the game-solving.

We depict the average cost for varying system sizes in Figure 3. We observe that SBV performs worse than ABV and, more importantly, scales poorly in the size of the system. This is contrary to the theoretical analysis of ABV and SBV. As the detailed breakdown of the running time suggests, the poor performance is due to the costly construction of the game and the time taken to solve the game. An almost identical picture emerges if we compare ABV in SBV relative to the property size (we give a plot in the full version [9]). While, in this case, the results match the theory (i.e., SBV scales worse in the size of the specification), we find that the bottleneck for SBV is not the LTL-to-DPA translation (which, in theory, is exponentially more expensive than the LTL-to-NBA translation used in ABV), but, again the construction and solving of the parity game.

We remark that the SBV engine we used [7] is not optimized and always constructs the full (reachable) game graph. The poor performance of SBV can be attributed to the fact that the size of the game does, in the worst case, scale quadratically in the size of the system (when considering \(\forall ^1\exists ^1\) properties). This is amplified in dense systems (i.e., systems with many transitions), as, with increasing transition density, the size of the parity games approaches its worst-case size (see the full version [9]). In contrast, the heavily optimized inclusion checker (in this case spot) seems to be able to check inclusion in almost constant time (despite being exponential in theory). This efficiency of mature language inclusion checkers is what enables AutoHyper to achieve remarkable performance that often exceeds that of symbolic methods such as BMC (cf. Section 7) and further strengthens the practical impact of Proposition 1.

9 Conclusion

In this paper, we have presented AutoHyper, the first complete model checker for HyperLTL with an arbitrary quantifier prefix. We have demonstrated that AutoHyper can check many interesting properties involving quantifier alternations and often outperforms symbolic methods such as BMC, sometimes by orders of magnitude. We believe the biggest advantage of AutoHyper to be its push-button functionality combined with its completeness: As a user, one does not need to worry whether AutoHyper is applicable to a particular property (different from, e.g., SBV or BMC) and does not need to provide hints (e.g., in the form of explicit strategies in SBV).

Apart from evaluating AutoHyper’s performance on a range of benchmarks, we have used AutoHyper to (1) compare various backend language inclusion checkers, (2) explore practical verification beyond one quantifier alternation (which is not as infeasible as suggested by the theory), and (3) rigorously evaluate the effectiveness and efficiency of strategy-based verification in practice (which, different than suggested by a theoretical analysis, performs worse than automata-based methods).