Keywords

figure a
figure b

1 Introduction

The framework of Constrained Horn Clauses (CHC) has been proposed as a unified, purely logic-based, intermediate format for software verification tasks [33]. CHC provides a powerful way to model various verification problems, such as safety, termination, and loop invariant computation, across different domains like transition systems, functional programs, procedural programs, concurrent systems, and more [33,34,35, 41]. The key advantage of CHC is the separation of modelling from solving, which aligns with the important software design principle—separation of concerns. This makes CHCs highly reusable, allowing a specialized CHC solver to be used for different verification tasks across domains and programming languages. The main focus of the front end is then to translate the source code into the language of constraints, while the back end can focus solely on the well-defined formal problem of deciding satisfiability of a CHC system.

CHC-based verification is becoming increasingly popular, with several frameworks developed in recent years, including SeaHorn, Korn and TriCera for C [27, 28, 36], JayHorn for Java [44], RustHorn for Rust [48], HornDroid for Android [18], SolCMC and SmartACE for Solidity [2, 57]. A novel CHC-based approach for testing also shows promising results [58]. The growing demand from verifiers drives the development of specialized Horn solvers. Different solvers implement different techniques based on, e.g., model-checking approaches (such as predicate abstraction [32], CEGAR [22] and IC3/PDR  [16, 26]), machine learning, automata, or CHC transformations. Eldarica  [40] uses predicate abstraction and CEGAR as the core solving algorithm. It leverages Craig interpolation [23] not only to guide the predicate abstraction but also for acceleration [39]. Additionally, it controls the form of the interpolants with interpolation abstraction [46, 53]. Spacer  [45] is the default algorithm for solving CHCs in Z3  [51]. It extends PDR-style algorithm for nonlinear CHC [38] with under-approximations and leverages model-based projection for predecessor computation. Recently it was enriched with global guidance [37]. Ultimate TreeAutomizer  [25] implements automata-based approaches to CHC solving [43, 56]. HoIce  [20] implements a machine-learning-based technique adapted from the ICE framework developed for discovering inductive invariants of transition systems [19]. FreqHorn [29, 30] combines syntax-guided synthesis [4] with data derived from unrollings of the CHC system.

According to the results of the international competition on CHC solving CHC-COMP [24, 31, 54], solvers applying model-checking techniques, namely Spacer and Eldarica, are regularly outperforming the competitors. These are the solvers most often used as the back ends in CHC-based verification projects. However, only specific algorithms have been explored in these tools for CHC solving, limiting their application for diverse verification tasks. Experience from software verification and model checking of transition systems shows that in contrast to the state of affairs in CHC solving, it is possible to build a flexible infrastructure with a unified environment for multiple back-end solving algorithms. CPAchecker  [6,7,8,9,10,11], and Pono  [47] are examples of such tools.

This work aims to bring this flexibility to the general domain-independent framework of constrained Horn clauses. We present Golem, a new solver for CHC satisfiability, that provides a unique combination of flexibility and efficiency.Footnote 1 Golem implements several SMT-based model-checking algorithms: our recent model-checking algorithm based on Transition Power Abstraction (TPA) [13, 14], and state-of-the-art model-checking algorithms Bounded Model Checking (BMC) [12], k-induction [55], Interpolation-based Model Checking (IMC) [49], Lazy Abstractions with Interpolants (LAWI) [50] and Spacer  [45]. Golem achieves efficiency through tight integration with the underlying interpolating SMT solver OpenSMT  [17, 42] and preprocessing transformations based on predicate elimination, clause merging and redundant clause elimination. The flexible and modular framework of OpenSMT enables customization for different algorithms; its powerful interpolation modules, particularly, offer fine control (in size and strength) with multiple interpolant generation procedures. We report experimentation that confirms the advantage of multiple diverse solving techniques and shows that Golem is competitive with state-of-the-art Horn solvers on large sets of problems.Footnote 2 Overall, Golem can serve as an efficient back end for domain-specific verification tools and as a research tool for prototyping and evaluating SMT- and interpolation-based verification techniques in a unified setting.

2 Tool Overview

In this section, we describe the main components and features of the tool together with the details of its usage. For completeness, we recall the terminology related to CHCs first.

Constrained Horn Clauses. A constrained Horn clause is formula \(\varphi \wedge B_1 \wedge B_2 \wedge \ldots \wedge B_n \implies H\), where \(\varphi \) is the constraint, a formula in the background theory, \(B_1, \ldots , B_n\) are uninterpreted predicates, and H is an uninterpreted predicate or \( false \). The antecedent of the implication is commonly denoted as the body and the consequent as the head. A clause with more than one predicate in the body is called nonlinear. A nonlinear system of CHCs has at least one nonlinear clause; otherwise, the system is linear.

Fig. 1.
figure 1

High-level architecture of Golem

Architecture. The flow of data inside Golem is depicted in Fig. 1. The system of CHCs is read from .smt2 file, a script in an extension of the language of SMT-LIB.Footnote 3 Interpreter interprets the SMT-LIB script and builds the internal representation of the system of CHCs. In Golem, CHCs are first normalized, then the system is translated into an internal graph representation. Normalization rewrites clauses to ensure that each predicate has only variables as arguments. The graph representation of the system is then passed to the Preprocessor, which applies various transformations to simplify the input graph. Preprocessor then hands the transformed graph to the chosen back-end engine. Engines in Golem implement various SMT-based model-checking algorithms for solving the CHC satisfiability problem. There are currently six engines in Golem: TPA, BMC, KIND, IMC, LAWI, and Spacer (see details in Sect. 3). User selects the engine to run using a command-line option --engine. Golem relies on the interpolating SMT solver OpenSMT  [42] not only for answering SMT queries but also for interpolant computation required by most of the engines. Interpolating procedures in OpenSMT can be customized on demand for the specific needs of each engine [1]. Additionally, Golem re-uses the data structures of OpenSMT for representing and manipulating terms.

Models and Proofs. Besides solving the CHC satisfiability problem, a witness for the answer is often required by the domain-specific application. Satisfiability witness is a model, an interpretation of the CHC predicates that makes all clauses valid. Unsatisfiability witness is a proof, a derivation of the empty clause from the input clauses. In software verification these witnesses correspond to program invariants and counterexample paths, respectively. All engines in Golem produce witnesses for their answer. Witnesses from engines are translated back through the applied preprocessing transformations. Only after this backtranslation, the witness matches the original input system and is reported to the user. Witnesses must be explicitly requested with the option --print-witness.

Models are internally stored as formulas in the background theory, using only the variables of the (normalized) uninterpreted predicates. They are presented to the user in the format defined by SMT-LIB [5]: a sequence of SMT-LIB’s define-fun commands, one for each uninterpreted predicate.

For the proofs, Golem follows the trace format proposed by Eldarica. Internally, proofs are stored as a sequence of derivation steps. Every derivation step represents a ground instance of some clause from the system. The ground instances of predicates from the body form the premises of the step, and the ground instance of the head’s predicate forms the conclusion of the step. For the derivation to be valid, the premises of each step must have been derived earlier, i.e., each premise must be a conclusion of some derivation step earlier in the sequence. To the user, the proof is presented as a sequence of derivations of ground instances of the predicates, where each step is annotated with the indices of its premises. See Example 1 below for the illustration of the proof trace.

Golem also implements an internal validator that checks the correctness of the witnesses. It validates a model by substituting the interpretations for the predicates and checking the validity of all the clauses with OpenSMT. Proofs are validated by checking all conditions listed above for each derivation step. Validation is enabled with an option --validate and serves primarily as a debugging tool for the developers of witness production.

Example 1

Consider the following CHC system and the proof of its unsatisfiability.

figure c

The derivation of \( false \) consists of four derivation steps. Step 1 instantiates the first clause for \(x:= 1\). Step 2 instantiates the second clause for \(x:= 1\) and \(x^{\prime }:= 2\). Step 3 applies resolution to the instance of the third clause for \(x:= 1\) and \(x^{\prime }:= 2\) and facts derived in steps 1 and 2. Finally, step 4 applies resolution to the instance of the fourth clause for \(x:= 2\) and the fact derived in step 3.

Preprocessing Transformations. Preprocessing can significantly improve performance by transforming the input CHC system into one more suitable for the back-end engine. The most important transformation in Golem is predicate elimination. Given a predicate not present in both the body and the head of the same clause, the predicate can be eliminated by exhaustive application of the resolution rule. This transformation is most beneficial when it also decreases the number of clauses. Clause merging is a transformation that merges all clauses with the same uninterpreted predicates in the body and the head to a single clause by disjoining their constraints. This effectively pushes work from the level of the model-checking algorithm to the level of the SMT solver. Additionally, Golem detects and deletes redundant clauses, i.e., clauses that cannot participate in the proof of unsatisfiability.

An important feature of Golem is that all applied transformations are reversible in the sense that any model or proof for the transformed system can be translated back to a model or proof of the original system.

3 Back-end Engines of Golem

The core components of Golem that solve the problem of satisfiability of a CHC system are referred to as back-end engines, or just engines. Golem implements several popular state-of-the-art algorithms from model checking and software verification: BMC, k-induction, IMC, LAWI and Spacer. These algorithms treat the problem of solving a CHC system as a reachability problem in the graph representation.

The unique feature of Golem is the implementation of the new model-checking algorithm based on the concept of Transition Power Abstraction (TPA). It is capable of much deeper analysis than other algorithms when searching for counterexamples [14], and it discovers transition invariants [13], as opposed to the usual (state) invariants.

3.1 Transition Power Abstraction

The TPA engine in Golem implements the model-checking algorithm based on the concept of Transition Power Abstraction. It can work in two modes: The first mode implements the basic TPA algorithm, which uses a single TPA sequence [14]. The second mode implements the more advanced version, split-TPA, which relies on two TPA sequences obtained by splitting the single TPA sequence of the basic version [13]. In Golem, both variants use the under-approximating model-based projection for propagating truly reachable states, avoiding full quantifier elimination. Moreover, they benefit from incremental solving available in OpenSMT, which speeds up the satisfiability queries.

The TPA algorithms, as described in the publications, operate on transition systems [13, 14]. However, the engine in Golem is not limited to a single transition system. It can analyze a connected chain of transition systems. In the software domain, this model represents programs with a sequence of consecutive loops. The extension to the chain of transition systems works by maintaining a separate TPA sequence for each node on the chain, where each node has its own transition relation. The reachable states are propagated forwards on the chain, while safe states—from which final error states are unreachable—are propagated backwards. In this scenario, transition systems on the chain are queried for reachability between various initial and error states. Since the transition relations remain the same, the summarized information stored in the TPA sequences can be re-used across multiple reachability queries. The learnt information summarizing multiple steps of the transition relation is not invalidated when the initial or error states change.

Golem ’s TPA engine discovers counterexample paths in unsafe transition systems, which readily translate to unsatisfiability proofs for the corresponding CHC systems. For safe transition systems, it discovers safe k-inductive transition invariants. If a model for the corresponding CHC system is required, the engine first computes a quantified inductive invariant and then applies quantifier elimination to produce a quantifier-free inductive invariant, which is output as the corresponding model.Footnote 4

The TPA engine’s ability to discover deep counterexamples and transition invariants gives Golem a unique edge for systems requiring deep exploration. We provide an example of this capability as part of the evaluation in Sect. 4.

3.2 Engines for State-of-the-Art Model-Checking Algorithms

Besides TPA, Golem implements several popular state-of-the-art model-checking algorithms. Among them are bounded model checking [12], k-induction [55] and McMillan’s interpolation-based model checking [49], which operate on transition systems. Golem faithfully follows the description of the algorithms in the respective publications.

Additionally, Golem implements Lazy Abstractions with Interpolants (LAWI), an algorithm introduced by McMillan for verification of software [50].Footnote 5 In the original description, the algorithm operates on programs represented with abstract reachability graphs, which map straightforwardly to linear CHC systems. This is the input supported by our implementation of the algorithm in Golem.

The last engine in Golem implements the IC3-based algorithm Spacer  [45] for solving general, even nonlinear, CHC systems. Nonlinear CHC systems can model programs with summaries, and in this setting, Spacer computes both under-approximating and over-approximating summaries of the procedures to achieve modular analysis of programs. Spacer is currently the only engine in Golem capable of solving nonlinear CHC systems.

All engines in Golem rely on OpenSMT for answering SMT queries, often leveraging the incremental capabilities of OpenSMT to implement the corresponding model-checking algorithm efficiently. Additionally, the engines IMC, LAWI, Spacer and TPA heavily use the flexible and controllable interpolation framework in OpenSMT  [1, 52], especially multiple interpolation procedures for linear-arithmetic conflicts [3, 15].

4 Experiments

In this section, we evaluate the performance of individual Golem ’s engines on the benchmarks from the latest edition of CHC-COMP. The goal of these experiments is to 1) demonstrate the usefulness of multiple back-end engines and their potential combined use for solving various problems, and 2) compare Golem against state-of-the-art Horn solvers.

The benchmark collections of CHC-COMP represent a rich source of problems from various domains.Footnote 6 Version 0.3.2 of Golem was used for these experiments. Z3-Spacer (Z3 4.11.2) and Eldarica  2.0.8 were run (with default options) for comparison as the best Horn solvers available. All experiments were conducted on a machine with an AMD EPYC 7452 32-core processor and 8\(\,\times \,\)32 GiB of memory; the timeout was set to 300 s. No conflicting answers were observed in any of the experiments. The results are in line with the results of the last editions of CHC-COMP where Golem participated [24, 31]. Our artifact for reproducing the experiments is available at https://doi.org/10.5281/zenodo.7973428.

4.1 Category LRA-TS

We ran all engines of Golem on all 498 benchmarks from the LRA-TS (transition systems over linear real arithmetic) category of CHC-COMP.

Table 1. Number of solved benchmarks from LRA-TS category.

Table 1 shows the number of benchmarks solved per engine, together with a virtual best (VB) engine.Footnote 7 On unsatisfiable problems, the differences between the engines’ performance are not substantial, but the BMC engine firmly dominates the others. On satisfiable problems, we see significant differences. Figure 2 plots, for each engine, the number of solved satisfiable benchmarks (x-axis) within the given time limit (y-axis, log scale).

Fig. 2.
figure 2

Performance of Golem ’s engines on SAT problems of LRA-TS category.

The large lead of VB suggests that the solving abilities of the engines are widely complementary. No single engine dominates the others on satisfiable instances. The portfolio of techniques available in Golem is much stronger than any single one of them.

Moreover, the unified setting enables direct comparison of the algorithms. For example, we can conclude from these experiments that the extra check for k-inductive invariants on top of the BMC-style search for counterexamples, as implemented in the KIND engine, incurs only a small overhead on unsatisfiable problems, but makes the KIND engine very successful in solving satisfiable problems.

4.2 Category LIA-Lin

Next, we considered the LIA-Lin category of CHC-COMP. These are linear systems of CHCs with linear integer arithmetic as the background theory. There are many benchmarks in this category, and for the evaluation at the competition, a subset of benchmarks is selected (see [24, 31]). We evaluated the LAWI and Spacer engines of Golem (the engines capable of solving general linear CHC systems) on the benchmarks selected at CHC-COMP 2022 and compared their performance to Z3-Spacer and Eldarica. Notably, we also examined a specific subcategory of LIA-lin, namely extra-small-liaFootnote 8 with benchmarks that fall into the fragment accepted by Golem ’s TPA engine.

There are 55 benchmarks in extra-small-lia subcategory, all satisfiable, but known to be highly challenging for all tools. The results, given in Table 2, show that split-TPA outperforms not only LAWI and Spacer engines in Golem, but also Z3-Spacer. Only Eldarica solves more benchmars. We ascribe this to split-TPA ’s capability to perform deep analysis and discover transition invariants.

Table 2. Number of solved benchmarks from extra-small-lia subcategory.

For the whole LIA-Lin category, 499 benchmarks were selected in the 2022 edition of CHC-COMP [24]. The performance of the LAWI and Spacer engines of Golem, Z3-Spacer and Eldarica on this selection is summarized in Table 3. Here, the Spacer engine of Golem significantly outperforms the LAWI engine. Moreover, even though Golem loses to Z3-Spacer, it beats Eldarica. Given that Golem is a prototype, and Z3-Spacer and Eldarica have been developed and optimized for several years, this demonstrates the great potential of Golem.

Table 3. Number of solved benchmarks from LIA-Lin category.

4.3 Category LIA-Nonlin

Finally, we considered the LIA-Nonlin category of benchmarks of CHC-COMP, which consists of nonlinear systems of CHCs with linear integer arithmetic as the background theory. For the experiments, we used the 456 benchmarks selected for the 2022 edition of CHC-COMP. Spacer is the only engine in Golem capable of solving nonlinear CHC systems; thus, we focused on a more detailed comparison of its performance against Z3-Spacer and Eldarica. The results of the experiments are summarized in Fig. 3 and Table 4.

Fig. 3.
figure 3

Comparison on LIA-Nonlin category ( - SAT, - UNSAT). (Color figure online)

Table 4. Number of solved benchmarks from LIA-Nonlin category. The number of uniquely solved benchmarks is in parentheses.

Overall, Golem solved fewer problems than Z3-Spacer but more than Eldarica; however, all tools solved some instances uniquely. A detailed comparison is depicted in Fig. 3. For each benchmark, its data point in the plot reflects the runtime of Golem (x-axis) and the runtime of the competitor (y-axis). The plots suggest that the performance of Golem is often orthogonal to Eldarica, but highly correlated with the performance of Z3-Spacer. This is not surprising as the Spacer engine in Golem is built on the same core algorithm. Even though Golem is often slower than Z3-Spacer, there is a non-trivial amount of benchmarks on which Z3-Spacer times out, but which Golem solves fairly quickly. Thus, Golem, while being a newcomer, already complements existing state-of-the-art tools, and more improvements are expected in the near future.

To summarise, the overall experimentation with different engines of Golem demonstrates the advantages of the multi-engine general framework and illustrates the competitiveness of its analysis. It provides a lot of flexibility in addressing various verification problems while being easily customizable with respect to the analysis demands.

5 Conclusion

In this work, we presented Golem, a flexible and effective Horn solver with multiple back-end engines, including recently-introduced TPA-based model-checking algorithms. Golem is a suitable research tool for prototyping new SMT-based model-checking algorithms and comparing algorithms in a unified framework. Additionally, the effective implementation of the algorithm achieved with tight coupling with the underlying SMT solver makes it an efficient back end for domain-specific verification tools. Future directions for Golem include support for VMT input format [21] and analysis of liveness properties, extension of TPA to nonlinear CHC systems, and support for SMT theories of arrays, bit-vectors and algebraic datatypes.