Software Verification: 10th Comparative Evaluation (SV-COMP 2021)

SV-COMP 2021 is the 10th edition of the Competition on Software Verification (SV-COMP), which is an annual comparative evaluation of fully automatic software verifiers for C and Java programs. The competition provides a snapshot of the current state of the art in the area, and has a strong focus on reproducibility of its results. The competition was based on 15 201 verification tasks for C programs and 473 verification tasks for Java programs. Each verification task consisted of a program and a property (reachability, memory safety, overflows, termination). SV-COMP 2021 had 30 participating verification systems from 27 teams from 11 countries.


Introduction
Among several other objectives, the Competition on Software Verification (SV-COMP, https://sv-comp.sosy-lab.org/2021) showcases the state of the art in the area of automatic software verification. This edition of SV-COMP is already the 10th edition of the competition and presents again an overview of the currently achieved results by tool implementations that are based on the most recent ideas, concepts, and algorithms for fully automatic verification. This competition report describes the (updated) rules and definitions, presents the competition results, and discusses some interesting facts about the execution of the competition experiments. The objectives of the competitions were discussed earlier (1-4 [16]) and extended over the years (5-6 [17]): 1. provide an overview of the state of the art in software-verification technology and increase visibility of the most recent software verifiers, 2. establish a repository of software-verification tasks that is publicly available for free use as standard benchmark suite for evaluating verification software, 3. establish standards that make it possible to compare different verification tools, including a property language and formats for the results, 4. accelerate the transfer of new verification technology to industrial practice by identifying the strengths of the various verifiers on a diverse set of tasks, 5. educate PhD students and others on performing reproducible benchmarking, packaging tools, and running robust and accurate research experiments, and 6. provide research teams that do not have sufficient computing resources with the opportunity to obtain experimental results on large benchmark sets.
The previous report [17] discusses the outcome of the SV-COMP competition so far with respect to these objectives.

Related Competitions.
Competitions are an important evaluation method and there are many competitions in the field of formal methods. We refer to the previous report [17] for a more detailed discussion and give here only the references to the most related competitions [9,19,55,56].
Quick Summary of Changes. We strive to continuously improve the competition, and this report describes the changes of the last year. In the following we list a brief summary of new items in SV-COMP 2021: • SPDX identification of licenses in SV-Benchmarks collection • WitnessLint: New checker for syntactical validity of verification witnesses • Upgrade of the task-definition format to version 2.0 • Addition of several verification tasks and whole new sub-categories to the SV-Benchmarks collection • Elimination of competition-specific functions __VERIFIER_error and __VERIFIER_assume from the verification tasks (and rules) • Change in scoring schema: Unconfirmed results not counted anymore (when validation was applied) • CoVeriTeam: New tool that can be used to remotely execute verification runs on the competition machines • Automatic participation of previous verifiers 2 Organization, Definitions, Formats, and Rules Procedure. The overall organization of the competition did not change in comparison to the earlier editions [10,11,12,13,14,15,16,17]. SV-COMP is an open competition (also known as comparative evaluation), where all verification tasks are known before the submission of the participating verifiers, which is necessary due to the complexity of the C language. The procedure is partitioned into the benchmark submission phase, the training phase, and the evaluation phase. The participants received the results of their verifier continuously via e-mail (for pre-runs and the final competition run), and the results were publicly announced on the competition web site after the teams inspected them. The Competition Jury oversees the process and consists of the competition chair and one member of each participating team. Team representatives of the jury are listed in Table 5. License Requirements. Starting 2018, SV-COMP required that the verifier must be publicly available for download and has a license that (i) allows reproduction and evaluation by anybody (incl. results publication), (ii) does not restrict the usage of the verifier output (log files, witnesses), and (iii) allows any kind of (re-)distribution of the unmodified verifier archive.
During the qualification phase, when the jury members inspect the verifier archives, several issues with licenses (missing licenses, incompatibilities) were detected that the developers were able to address the issues on time.
With SV-COMP 2021, the community started the process of making the benchmark collection REUSE compliant (https://reuse.software) by adding SPDX license identifiers (https://spdx.dev). A few directories are properly labeled already, and continuous-integration checks with REUSE ensure that new contributions adhere to the standard.
Validation of Results. This time, the validation of the verification results was done by seven validation tools, which are listed in Table 1, including references to literature. The validators CPAchecker and UAutomizer support the competition since the beginning of its result validation in 2015. Execution-based validation was added in 2018 using CPA-w2t and FShell-w2t. Two new validators participated since the previous SV-COMP in 2020: Nitwit and MetaVal. A few categories were still excluded from validation because no validators were available for some types of programs or properties.
For SV-COMP 2021, the new validator WitnessLint was added for validating witnesses regarding their syntax. It checks the witnesses produced by the verification tools against the specification of the format for verification witnesses (https://github.com/sosy-lab/sv-witnesses/tree/svcomp21). For example, WitnessLint ensures that a verification witness is a proper XML/GraphML file and contains the required meta data. This means that the validators can focus on the validation of the verification result, assuming that the verification witness is syntactically valid. If the witness linter deems a verification witness as syntactically invalid, then the answers of the result validators are ignored and the result is not counted as confirmed.
Task-Definition Format 2.0. The format for the task definitions in the SV-Benchmarks repository was recently extended to include a set of options that can carry information from the verification task to the verification tool. SV-COMP 2021 used the task-definition format in version 2.0 (https://gitlab.com/sosy-lab/benchmarking/task-definition-format/-/tree/2.0). More details can be found in the report for Test-Comp 2021 [19].
Properties. Please see the 2015 competition report [13] for the definition of the properties and the property format. All specifications are available in the directory c/properties/ of the benchmark repository.
Categories. The updated category structure is illustrated by Fig. 1. The categories are also listed in Tables 7 and 8, and described in detail on the competition web site (https://sv-comp.sosy-lab.org/2021/benchmarks.php). Compared to the category structure for SV-COMP 2020, we added the sub-categories XCSP and Combinations to category ReachSafety, and the sub-categories DeviceDriversLinux64Large ReachSafety, uthash MemSafety, uthash NoOverflows, and uthash ReachSafety to category SoftwareSystems.
Another effort was to integrate some of the Juliet benchmark tasks [31] into the SV-Benchmarks collection. We requested a license for the Juliet programs that properly clarifies the license terms also outside the USA. We thank our colleagues from NIST for releasing their Juliet benchmark (which is declared as public domain) under the Creative Commons license CC0-1.0 (https://github.com/sosy-lab/sv-benchmarks/blob/svcomp21/LICENSES/CC0-1.0.txt). SV-COMP 2021 used many verification tasks from Juliet, in particular for the memory-safety properties CWE121 (stack-based buffer overflow), CWE401 (memory leak), CWE415 (double free), CWE476 (null-pointer dereference), and CWE590 (free memory that is not on the heap) (see https://github.com/sosy-lab/sv-benchmarks/blob/svcomp21/c/MemSafety-Juliet.set).
All those new contributions to the benchmark collection lead to the growth of the number of verification tasks from 11 052 in SV-COMP 2020 to 15 201 in SV-COMP 2021.
Verification Tasks. The previous verification tasks and competition rules used special definitions for the functions __VERIFIER_error and __VERIFIER_assume. These special definitions were found to be unintuitive and inconsistent with expectations in the verification community, and repeatedly caused confusion among participants. A call of function __VERIFIER_error() was defined to never return. A call of function __VERIFIER_assume(p) was defined such that if expression p evaluates to false, then the function loops forever, otherwise the function returns without any side effects. This led to unintended interactions with other properties.
We eliminated these two functions in two steps. In the first step, each function call was replaced by a C-code implementation of the intended behavior. In most of the cases, __VERIFIER_error(); was replaced by the C code reach_error(); abort();, where reach_error is a 'normal' function, i.e., one whose interpretation follows the C standard [3].
Eliminating __VERIFIER_assume was more complicated: In some tasks for property memory-cleanup, __VERIFIER_assume(p); was replaced by the C code assume_cycle_if_not(p);, which is implemented    as if (!p) while(1);, while for other tasks, __VERIFIER_assume(p); was replaced by assume_abort_if_not(p);, which is implemented as if (!p) abort();. The solution nicely illustrates the problem of the special semantics: Consider property memory-cleanup, which requires that all allocated memory is deallocated before the program terminates. Here, the desired behavior of a failing assume statement would be that the program does not terminate (and does not unintendedly violate the memory-cleanup property). Now consider property termination, which requires that every path finally reaches the end of the program. Here, the desired behavior of a failing assume statement would be that the program terminates (and does not unintendedly violate the termination property).
In the second step, the specifications for functions __VERIFIER_error and __VERIFIER_assume were removed from the competition rules (because no such functions exist anymore in the SV-Benchmarks collection).
Scoring Schema and Ranking. Table 2 provides an overview and Fig. 2 visually illustrates the score assignment for the reachability property as an example.
The scoring schema was changed regarding the special rule for unconfirmed correct results for expected result True. There was a rule during the transitioning phase to assign one point if the answer matches the expected result but the witness was not confirmed. Now score points are only assigned if the results got validated (or no validator was available).
As in the last years, the rank of a verifier was decided based on the sum of points (normalized for meta categories). In case of a tie, the rank was decided based on success run time, which is the total CPU time over all verification tasks for which the verifier reported a correct verification result. Opt-out from Categories and Score Normalization for Meta Categories was done as described previously [11] (page 597).

Reproducibility
To allow independent reproduction of the SV-COMP results, we made all major components that were used in the competition available in public versioncontrol repositories. An overview of the components that contribute to the reproducible setup of SV-COMP is provided in Fig. 3, and the details are given in Table 3. We refer to the SV-COMP 2016 report [14] for a description of all components of the SV-COMP organization.
We have published the competition artifacts at Zenodo (see Table 4) to guarantee their long-term availability and immutability. These artifacts comprise the verification tasks, the competition results, the produced verification witnesses, and the BenchExec package. The archive for the competition results includes the raw results in BenchExec's XML exchange format, the log output of the verifiers and validators, and a mapping from file names to SHA-256 hashes. The hashes of the files are useful for validating the exact contents of a file, and accessing the files inside the archive that contains the verification witnesses.
Competition Workflow. The workflow of the competition is described in the report for Test-Comp 2021 [19].
CoVeriTeam. The competition was for the first time supported by CoVeriTeam [26] (https://gitlab.com/sosy-lab/software/coveriteam/), which is a tool for cooperative verification. Among its many capabilities, it enables remote execution of verification runs directly on the competition machines, which was found to be a valuable service for trouble shooting.

Results and Discussion
The results of the competition experiments represent the state of the art in fully automatic software-verification tools. The report shows the results, in terms of effectiveness (number of verification tasks that can be solved and correctness of the results, as accumulated in the score) and efficiency (resource consumption in terms of CPU time and CPU energy). The results are presented in the same way as in last years, such that the improvements compared to last year are easy    [82] to identify. The results presented in this report were inspected and approved by the participating teams. We now discuss the highlights of the results.
Participating Verifiers. Table 5 provides an overview of the participating verification systems (see also the listing on the competition web site at https://sv-comp.sosy-lab.org/2021/systems.php). Table 6 lists the algorithms and techniques that are used by the verification tools.
Automatic Participation. To ensure that the comparative evaluation continues to give an overview of the state of the art that is as broad as possible, a rule was introduced before SV-COMP 2020 which enables the option for the organizer to reuse systems that participated in previous years for the comparative evaluation. This option was used three times in SV-COMP 2021: for Coastal, PredatorHP, and SPF. Those participations are marked as 'hors concours' in Table 5.   [30] (integrated in BenchExec [28]). One complete verification execution of the competition consisted of 163 177 verification runs (each verifier on each verification task of the selected categories according to the opt-outs), consuming 470 days of CPU time and 126 kWh of CPU energy (without validation). Witness-based result validation required 961 919 validation runs (each validator on each verification task for categories with witness validation, and for each verifier), consuming 274 days of CPU time. Each tool was executed several times, in order to make sure no installation issues occur during the execution. Including preruns, the infrastructure managed a total of 1.33 million verification runs consuming 4.16 years of CPU time, and 7.31 million validation runs consuming 3.84 years of CPU time.
Quantitative Results. Table 7 presents the quantitative overview of all tools and all categories. The head row mentions the category, the maximal score for the category, and the number of verification tasks. The tools are listed in alphabetical order; every table row lists the scores of one verifier. We indicate the top three candidates by formatting their scores in bold face and in larger font size. An empty table cell means that the verifier opted-out from the respective main category (perhaps participating in subcategories only, restricting the evaluation to a specific topic). More information (including interactive tables, quantile plots for every category, and also the raw data in XML format) is available on the competition web site (https://sv-comp.sosy-lab.org/2021/results) and in the results artifact (see Table 4). Table 8 reports the top three verifiers for each category. The run time (column 'CPU Time') and energy (column 'CPU Energy') refer to successfully solved verification tasks (column 'Solved Tasks'). We also report the number of tasks for which no witness validator was able to confirm the result (column 'Unconf. Tasks'). The columns 'False Alarms' and 'Wrong Proofs' report the number of verification tasks for which the verifier reported wrong results, i.e., reporting a counterexample when the property holds (incorrect False) and claiming that the program fulfills the property although it actually contains a bug (incorrect True), respectively. Score-Based Quantile Functions for Quality Assessment. We use scorebased quantile functions [11,28] because these visualizations make it easier to understand the results of the comparative evaluation. The web site (https://sv-comp.sosy-lab.org/2021/results) and the results archive (see Table 4) include such a plot for each (sub-)category. As an example, we show the plot for category C-Overall (all verification tasks) in Fig. 4. A total of 10 verifiers participated in category C-Overall, for which the quantile plot shows the overall performance over all categories (scores for meta categories are normalized [11]). A more detailed discussion of score-based quantile plots, including examples of what insights one can obtain from the plots, is provided in previous competition reports [11,14].
Alternative Rankings. The community suggested to report a couple of alternative rankings that honor different aspects of the verification process as complement to the official SV-COMP ranking. Table 9 is similar to Table 8, but   More details were given previously [11]. A logarithmic scale is used for the time range from 1 s to 1000 s, and a linear scale is used for the time range between 0 s and 1 s. sults' the sum of false alarms and wrong proofs in number of errors, and column 'Rank Measure' gives the measure to determine the alternative rank.
Correct Verifiers -Low Failure Rate. The right-most columns of Table 8 report that the verifiers achieve a high degree of correctness (all top three verifiers in the C-Overall have less than 2 ‰ wrong results). The winners of category Java-Overall produced not a single wrong answer. The first category in   , the number of errors per score point (E/sp). We use E as unit for number of incorrect results and sp as unit for total score. The worst result was 0.032 E/sp in SV-COMP 2020 and is now improved to 0.023 E/sp.
Green Verifiers -Low Energy Consumption. Since a large part of the cost of verification is given by the energy consumption, it might be important to also consider the energy efficiency. The second category in Table 9 uses the energy consumption per score point as rank measure: total CPU energy total score , with the unit J/sp. The worst result from SV-COMP 2020 was 2 200 J/sp, now improved to 630 J/sp. New Verifiers. To acknowledge the verification systems that participate for the first or second time in SV-COMP, Table 10 lists the new verifiers (in SV-COMP 2020 or SV-COMP 2021).
Verifiable Witnesses. Results validation is of primary importance in the competition. All SV-COMP verifiers are required to justify the result (True or False) by producing a verification witness (except for those categories for which no witness validator is available). We used six independently developed witness-based result validators and one witness linter (see Table 1).  Table 11 shows the confirmed versus unconfirmed results: the first column lists the verifiers of category C-Overall, the three columns for result True reports the total, confirmed, and unconfirmed number of verification tasks for which the verifier answered with True, respectively, and the three columns for result False reports the total, confirmed, and unconfirmed number of verification tasks for which the verifier answered with False, respectively. More information (for all verifiers) is given in the detailed tables on the competition web site and in the results artifact; all verification witnesses are also contained in the witnesses artifact (see Table 4). The verifiers 2ls and UKojak are the winners in terms of confirmed results for expected results True and False, respectively. The overall interpretation is similar to SV-COMP 2020 [17].

Conclusion
The 10th edition of the Competition on Software Verification (SV-COMP 2021) had 30 participating verification systems from 11 countries (see Fig. 5 for the participation numbers and Table 5 for the details). The competition does not only execute the verifiers and collect results, but also validates the verification results using verification witnesses. We used six independent validators to check the results and a witness linter to check if the verification witnesses are syntactically valid ( Table 1). The number of verification tasks was increased to 15 201 in the C category and to 473 in the Java category. The high quality standards of the TACAS conference, in particular with respect to the important principles of fairness, community support, and transparency are ensured by a competition jury in which each participating team had a member. The results of our comparative evaluation provide a broad overview of the state of the art in automatic software verification. SV-COMP is instrumental in developing more reliable tools, as well as identifying and propagating successful techniques for software verification.
Data Availability Statement. The verification tasks and results of the competition are published at Zenodo, as described in Table 4. All components and data that are necessary for reproducing the competition are available in public version repositories, as specified in Table 3. Furthermore, the results are presented online on the competition web site for easy access: https://sv-comp.sosy-lab.org/2021/results/.