1 Introduction

There are now many tools for verification of computer programs, but as far as we know, none of them claims to always produce correct results. The results of the Competition on Software Verification (SV-COMP) show that out of the 57 verifiers participating in the main category called Overall in the last five years (there were 10, 13, 11, 10, and 13 participants in this category in years 2018–2022, respectively), only four provide no incorrect results, namely Ultimate Kojak in 2018, CPA-Seq and Symbiotic in 2019, and Goblint in 2022. Moreover, communication with industrial developers reveals that even a relatively small portion of incorrect results can devaluate credibility of a verification tool. As a solution, many verifiers now accompany their verification results by some evidence in the form of verification witnesses. These verification witnesses can be independently analyzed and potentially confirmed by witness validators. Industrial developers can use witness validation to triage the verification results: the results with unconfirmed witnesses are ignored and attention is focused on the confirmed ones.

Independent validation of verification witnesses is possible thanks to a machine-readable exchange format for witnesses. The first such format [11] was introduced in 2015. It supported only violation witnesses (also called counterexamples) produced when a verifier reports that a given program violates a considered safety specification. The authors of this format also extended the verification tools CPAchecker and Ultimate Automizer to support validation of these witnesses. In 2016, the format was extended to accommodate also witnesses for the cases when a verifier decides that a given program satisfies a given specification [9]. Such witnesses are called correctness witnesses, and they should contain some hints for the proof of program correctness. In the same year, the two mentioned tools were extended to support validation of correctness witnesses as well. In 2018, a new (execution-based) approach for checking of violation witnesses was introduced and implemented in tools CPA-witness2test and FShell-witness2test  [12]. Another two witness validators called MetaVal  [14] and NitWit  [21] were introduced in 2020, followed by validators Dartgnan  [19] and Symbiotic-Witch  [1] introduced in 2022. The evolution of the witness format and validators is driven by the SV-COMP community. Since SV-COMP 2021, the competition rewards with points only the verification results with witnesses confirmed by at least one witness validator (with the exception of several categories for which witness confirmation is not required for correctness witnesses due to unavailability of suitable witness validators).

The witness format [10, 11] is based on GraphML. Each witness contains information about the corresponding verification task (in particular, the program and the specification) and the verification result it witnesses. The main part of the witness resembles an automaton decorated with additional information. Hence, we talk about witness automata. A violation witness automaton represents a set of program paths and it is valid if at least one of these paths is feasible and violates the considered specification. Figure 1 provides an example of a C program that violates the specification that function reach_error is never called, and three different violation witnesses. In general, a violation witness automaton describes a set of program paths by specifying passed program locations (depicted by line numbers on edges), called functions, taken branches, constraints on variable values, etc. Each violation witness automaton has to contain at least one error state representing a specification violation (depicted in red). Further, it can also contain sink states (depicted in blue) saying that the represented paths violating the specification are elsewhere. A witness can represent a single program path by specifying all program inputs (as in Fig. 1b), it can say nothing about input values and prescribe taken branches (as in Fig. 1c), or it can combine some branching information with restrictions on input values (as in Fig. 1d).

Fig. 1.
figure 1

Example C program with a bug (a) and violation witnesses for it: with test values (b), with branching information (c) and with intervals (d); taken from [10]

A correctness witness automaton represents program invariants and it is valid if all these invariants hold and the corresponding program satisfies the considered specification. Ideally, a correctness witness contains a minimal set of invariants implying that the program satisfies the specification. Figure 2 shows a fixed version of the C program (see the rectangle), which can be proven correct, and the correctness witness shows invariants (depicted in green) that help to re-establish the proof of correctness.

The examples of witnesses are adopted from literature [10] which provides their detailed description: in Sect. 4.2, Examples 7 and 8 explain the violation witnesses (pages 21–27), and in Sect. 4.3., Example 9 explains the correctness witness (pages 31–33). The witness format admits also trivial witnesses that provide no useful information. A trivial violation witness represents all program paths and a trivial correctness witness provides no invariant. Validation of a trivial witness is as hard as the original verification task.

Fig. 2.
figure 2

Corrected C program (a) and a correctness witness for it (b); the only difference to Fig. 1a is the corrected type in line 9 (highlighted); taken from [10]

Overview and Outline. A witness validator is given a witness and the corresponding verification task, and it aims at confirming the verification result by proving that the witness is valid.Footnote 1 On one side, the addition of the witness-validation step to the verification process increases the reliability of the confirmed verification results. On the other side, the reliability of witness validators is not challenged or even properly studied. As validators are often implemented using the same techniques as their corresponding verifiers (and by the same development teams), it is reasonable to expect that they also sometimes produce incorrect results.

In Sect. 2, we focus on the first goal of this paper, namely to evaluate the performance and reliability of current witness validators for C programs.Footnote 2 There are currently 8 such validators which can be divided into several categories according to their approach.

  • CPAchecker  [11], MetaVal  [14], and Ultimate Automizer  [11] create a product of a witness automaton and the original program and analyze it. A violation witness is confirmed if the product exhibits the specification violation described by the witness. A correctness witness is confirmed if the product satisfies the specification and the invariants in the witness are valid (cf. [16], Sect. 4.3).

  • CPA-witness2test  [12], CProver-witness2test (originally called FShell-witness2test) [12], and NitWit  [21] can handle only violation witnesses. They derive a single test from a given witness automaton and execute it. The witness is confirmed if the execution violates the considered specification.

  • Symbiotic-Witch  [1] can process also only violation witnesses. It performs symbolic execution of the given program and tracks the corresponding set of states in the witness automaton. A witness is confirmed if the symbolic execution violates the considered specification and the tracked set contains an error state of the witness automaton.

  • Dartgnan  [19] is a bounded model checker for parallel programs, which has been extended with the ability to analyze violation witnesses. It transforms the violation witness and the program into an SMT query, and it confirms the witness if the query is satisfiable.

We evaluate the validators on witnesses produced in SV-COMP 2022. As various validators support different specifications and program features, they are applicable only to witnesses created for verification tasks of selected SV-COMP categories. Verification tasks with C programs are currently divided into 6 main categories, which can be roughly characterized as follows.

  • ReachSafety contains sequential programs that should be checked for unreachability of a given error function.

  • MemSafety consists of sequential programs that should be checked to contain no invalid dereference, no invalid deallocation, and no memory leaks.

  • ConcurrencySafety contains parallel programs that should be checked for unreachability of a given error function.

  • NoOverflows collects sequential programs that should contain no overflow of a signed integer.

  • Termination consists of sequential programs that should be checked to have no infinite execution.

  • SoftwareSystems collects more complex programs that are usually a part of real software projects and they should be checked for specifications described in ReachSafety, MemSafety, or NoOverflows.

Table 1. Applicability of validators to violation and correctness witnesses from individual SV-COMP categories; some validator names are abbreviated

The applicability of the considered validators to violation and correctness witnesses of individual SV-COMP categories is summarized in Table 1. Please note that even if the table indicates that a certain validator is applicable to violation or correctness witnesses of a certain category, it does not mean that the validator can handle all such witnesses of this category (for example, a validator may not support a specific feature of some programs).

Verification tasks in SV-COMP are labelled with expected verification results. We consider the labelling with expected results as highly reliable due to the following penalty mechanism of SV-COMP and competitiveness of its community. In SV-COMP, if a verifier produces an incorrect result (i.e., the opposite to the expected one), it immediately gets many penalty points. If the authors of the verifier are confident that the result is correct, they can (and often do) challenge the expected result. The verification task is then discussed and potentially relabelled.Footnote 3 Unfortunately, there is no set of witnesses labelled as valid or invalid, and we cannot safely assume that all witnesses accompanying correct verification results are valid. In fact, there are known cases of correct verification results accompanied by invalid witnesses. For example, this is the case of some violation witnesses produced by Symbiotic 9 for some MemSafety benchmarks [17]. However, when a verifier produces an incorrect verification result, the corresponding witness has to be invalid. In our experiments, we apply the existing witness validators on all relevant witnesses of both correct and incorrect verification results computed in SV-COMP 2022.

Section 3 is devoted to the second goal of this paper: to initiate qualitative improvement of witness validators. In particular, we suggest extending the semantics of possible validator outcomes and we propose a formula for evaluating validators. Our suggestions have been recently accepted by the SV-COMP community and a new competition track for witness validators has been announced starting from SV-COMP 2023.

Related Work. Existing papers on witness validators typically present only the confirmation rates of considered validators on the set of witnesses accompanying correct verification results, which are implicitly assumed to be valid witnesses [10, 14, 19, 21]. Evaluation of validators on invalid witnesses accompanying incorrect verification results has been previously done only twice: in 2015 for a limited set of invalid violation witnesses and the initial versions of witness validators CPAchecker and Ultimate Automizer  [11] and in 2018 for a larger set of invalid violation witnesses and initial versions of witness validators CPA-witness2test and CProver-witness2test and then-current versions of CPAchecker and Ultimate Automizer  [12]. In contrast, we consider invalid verification witnesses for both violation and correctness results and all 8 currently available witness validators in their versions used in SV-COMP 2022.

More information about witnesses and their validation in the context of SV-COMP can be found in regular competition reports [5, 6]. There is also a study [4] on violation and correctness witnesses produced in SV-COMP 2019.

2 Evaluation

We would like to investigate the state of the art of witness validation. Therefore, we take a large set of 158848 known syntactically correct witnesses from SV-COMP 2022 and validate all those witnesses using all available witness validators for C programs and report the results.

Execution Environment. We executed all experiments on a cluster with 167 machines, each with a CPU of type Intel Xeon E3-1230 v5, 3.4 GHz, with 8 processing units (virtual cores), 33 GB RAM, operating system Ubuntu 20.04 (Linux 5.4.0-94-generic). Each validation run (execution of one validator on one verification task and witness) was limited to 2 processing units, 7 GB memory, and 900 s of CPU time for correctness validators and 90 s of CPU time for violation validators. We chose this configuration because it was used in SV-COMP 2022. In order to ensure reliable measurement and control of the computing resources and isolation of processes, we used the benchmarking framework BenchExec [13].

Evaluated Validators. In this evaluation, we consider all eight witness validators for C programs that participated in SV-COMP 2022. Table 1 lists the validators and the categories for which they can validate witnesses.

Data Set and Benchmark. The witnesses and the verification tasks (program and specification) are taken from the data set of SV-COMP 2022 at Zenodo [8]. SV-COMP organizes the verification tasks with C programs into six categories. We take all witnesses produced for these tasks by all participating verification tools. Then we remove the witnesses for which WitnessLint produced an exception. Exceptions are typically caused by syntax problems or too large witness files.

We classify each violation witness for a correct program as invalid (because the competition classified the result of the verifier as false alarm), and we classify each correctness witness for a buggy program as invalid (because the competition classified the result of the verifier as wrong claim of correctness). All other witnesses are classified as \(\text {valid}^{*}\), because they do not contradict the expected result. We use the term \(\text {valid}^{*}\) with asterisk because there are witnesses that do not contradict the expected result but are still invalid (e.g., there can be a violation witness representing no feasible path violating the considered specification, even if such a path exists). However, there is currently no reliable way to automatically identify invalid witnesses that do not contradict the expected result. Tables 2 and 3 report in column ‘Witnesses’ the number of \(\text {valid}^{*}\) and invalid witnesses for each category.

Table 2. Validation of violation witnesses by eight violation validators; the numbers are hyperlinked to the tables generated by BenchExec
Table 3. Validation of correctness witnesses by three correctness validators; the numbers are hyperlinked to the tables generated by BenchExec

Results. We report the results of our validation experiments in two tables. The results on violation witnesses are presented in Table 2 and the results on correctness witnesses in Table 3. For each category and validator, row ‘\(\text {valid}^{*}\) ’ reports the number of \(\text {valid}^{*}\) witnesses confirmed by the validator, and row ‘invalid’ reports the number of invalid witnesses erroneously confirmed by the validator. Due to the source of invalid witnesses described above, each erroneous confirmation of an invalid witness here means that the validator either confirmed a violation witness, but the program does not violate the specification, or it confirmed a correctness witness, but the program does violate the specification. In the following we highlight a few observations revealed by the results.

Soundness of validators. There is only one validator, namely Dartgnan, that does not confirm any invalid violation witness. The validator participated only in category ConcurrencySafety as it is specialized in parallel programs (Table 2). CPAchecker does not confirm any invalid correctness witness (Table 3).

There seems to be a particularly difficult category. The category SoftwareSystems has a large number of invalid violation witnesses (Table 2, ‘Witnesses’ column). This means that in this category, many verification runs report a false alarm for a correct program, accompanied by an invalid violation witness. The violation witnesses in this category seem to be difficult for validation, as only CPAchecker confirmed more than 10 % of \(\text {valid}^{*}\) violation witnesses. Moreover, all validators that confirmed at least ten \(\text {valid}^{*}\) violation witnesses confirmed also some invalid violation witnesses.

Our evaluation revealed technical problems. The validator MetaVal does not confirm any violation witness (Table 2) in categories ReachSafety and SoftwareSystems and confirms a large number of invalid correctness witnesses (Table 3) in these categories. The reason for those incorrect validation results is that the validator was not adapted to a new rule of SV-COMP that was introduced for SV-COMP 2021: All verification tasks in those categories were changed to using a new logic to encode invalid function calls. Other specifications are not affected by this change.

Summary. Most of the invalid witnesses that were incorrectly confirmed were due to bugs in validators. The conclusion is that the quality of validators should be increased by establishing means to stimulate the inspection and quality control of validation tools. A competition track for validators suggested in the following section could help drawing the attention of developers to inspecting results of validators. Currently, SV-COMP uses validators for confirmation of verification results, but does not evaluate the quality of their results.

Threats to Validity. Regarding internal validity, the main threat to our results is that we rely on the expected results for verification tasks. If those were incorrectly specified, our classification of validator results would also be incorrect. But the verification tasks in the benchmark collection that we use are actively maintained by the community and the participating teams inspected the results of their verifiers. The 33 actively participating teams in SV-COMP 2022 have approved the results of their verifiers before the results were published.

For executing the experiments, we used the publicly-available benchmarking framework BenchExec [13], which gives us access to the modern features of the Linux kernel for controling the resources and for isolating executions. This framework is used by several competitions and is actively maintained. For job distribution on the cluster we use VerifierCloud, which is also used by several competitions and research groups for their lab work. It is unlikely that a bug in the benchmarking infrastructure causes wrong results.

Regarding external validity, our results are specific to witness validators for the programming language C, because this is the only language for which a large set of verification and validation tools exist. The first two validators [18, 20] for Java were introduced for SV-COMP 2022. Further, our results are specific to validators that participated in SV-COMP and to the verification tasks from the SV-Benchmarks collection. We are not aware of any validators besides those participating in the competition, and we are not aware of a benchmark that is better suited for the evaluation than what is used by the competition. Therefore, we assume that our results are still significant because SV-COMP is comprehensive.

3 Suggestions for Advances in Witness Validation

Extended Semantics of Validator Outcomes. Possible validator answers recognized by SV-COMP are the same as possible answers of verifiers, which are

  • false, meaning that the given program violates the given specification and a violation witness was generated,

  • true, meaning that the given program satisfies the given specification and a correctness witness was generated, and

  • unknown, meaning that the verifier was unable to decide.

The interpretation of a witness-validator answer depends on the kind of the analyzed witness. A violation witness is confirmed if a validator outputs false. All other answers (including true and unknown) mean that the witness is not confirmed by this validator. Similarly, a correctness witness is confirmed if a validator outputs true and all other answers mean that the validator did not confirm the witness. In other words, even if a validator has the confidence to say that some witness is invalid, the competition rules give it the same semantics as unknown. As a consequence, there is no difference between witnesses that are not confirmed due to insufficient power of validators and those that were refuted by some validators.

We suggest to explicitly state the semantics of a validator output as follows. On violation witnesses, a validator should produce

  • false to confirm that there exists a program execution represented by the witness such that it violates the considered specification,

  • true to refute the witness as there is no program execution represented by the witness that violates the considered specification, or

  • unknown to indicate that it is unable to decide.

On correctness witnesses, a validator produces

  • false to refute the witness as there exists some execution violating the considered specification or some invariant given in the witness,

  • true to confirm the witness as the validator can prove that the program satisfies the considered specification with help of the invariants given in the witness and that all invariants given in the witness are valid, or

  • unknown to indicate that it is unable to decide.

Evaluation of Validators. One can find many areas of computer science (e.g., SMT solving), where some kind of competition or regular evaluation led to a rapid improvement of the state of the art. With this motivation, we suggest to extend SV-COMP with a comparative evaluation of witness validators, and we propose the following scoring schema for this evaluation.

Assume that we are given a witness validator, a set of \(\text {valid}^{*}\) witnesses, and a set of invalid witnesses. Our scoring schema is inspired by the established scoring schema for evaluating verifiers in SV-COMP. The community agreed that showing that a system satisfies a given specification deserves more credit than showing that the specification is violated. Hence, SV-COMP rewards correct (and confirmed) answers true with 2 points and correct (and confirmed) answers false with 1 point. The penalty factor for incorrect answers is \(-16\), which means that incorrect true yields \(-32\) points and incorrect false \(-16\) points.

The proposed scoring schema for validators is depicted in Fig. 3. We first describe the scores for invalid violation witnesses (the right side of the figure). Refutation of an invalid witness is rewarded with 2 points as it means to decide that all program paths represented by the witness satisfy the specification, which is an analogy to showing that a program satisfies its specification. Refutation of an invalid correctness witness is rewarded with 1 point as it corresponds to finding a violation of the specification or some invariant given in the witness. Confirmation of an invalid witness yields the penalty p for a violation witness and 2p for a correctness witness, where p is the penalty factor (with \(p < 0\)). Points and penalties for invalid witnesses are accumulated in \(p_{\text {invalid}}\). The proposed scores for \(\text {valid}^{*}\) witnesses (the left side of the figure) reflect the fact that these witnesses are only assumed to be valid and some of them can be actually invalid. Hence, we suggest to reward only confirmation of \(\text {valid}^{*}\) witnesses: 2 points for each confirmed correctness witness and 1 point for each confirmed violation witness. Points for \(\text {valid}^{*}\) witnesses are accumulated in \(p_{{\text {valid}^{*}}}\).

Fig. 3.
figure 3

Proposed scoring schema for evaluation of validators (with \(p<0\))

One can observe in Tables 2 and 3 that the number of incorrect witnesses is typically one or two orders of magnitude lower than the number of \(\text {valid}^{*}\) witnesses and this disbalance is assumed to increase if verifiers produce less incorrect verification results. Further, the \(p_{\text {invalid}}\) deserves a higher impact than \(p_{{\text {valid}^{*}}}\) as we do not really know whether \(\text {valid}^{*}\) witnesses are indeed valid. Hence, we propose to compute the score as the sum

$$ score = \frac{p_{{\text {valid}^{*}}}}{|{\text {valid}^{*}}|}+q\cdot \frac{p_{\text {invalid}}}{|\text {invalid}|}$$

where the points in \(p_{{\text {valid}^{*}}}\) and \(p_{\text {invalid}}\) are normalized by the cardinality of the corresponding witness sets and \(p_{\text {invalid}}\) is given a higher weight using the factor q.

We suggest to compute the validator scores separately for witnesses of each category. The overall score of a validator can be computed by the normalization used in SV-COMP to compute the overall scores of verifiers (see [3], page 597).

Our proposal of a comparative evaluation of witness validators based on the scoring schema above was presented and discussed at the SV-COMP community meeting on April 7, 2022. The community decided to establish a witness-validation track from SV-COMP 2023 onwards. The community further decided to use the suggested scoring schema and set the parameters to \(p=-16\) and \(q=2\).

4 Conclusion

Verification tools are complicated software systems, which naturally contain conceptual and programming mistakes. Therefore, it is imperative to apply validators to ensure that a verification engineer is not bothered with incorrect verification results. Our case study investigates the correctness of witness validators, in particular, how many invalid witnesses are confirmed by validators. The results indicate that there is room for improvement of the validators. We initiated the extension of SV-COMP by a comparative evaluation of witness validators that will utilize the full set of validator answers and use the presented scoring schema for ranking validators. If there is an incentive, then there will be improvement, as is shown by the enormous success of competitions in the field of formal methods [2].