Status Report on Software Testing: Test-Comp 2021

This report describes Test-Comp 2021, the 3rd edition of the Competition on Software Testing. The competition is a series of annual comparative evaluations of fully automatic software test generators for C programs. The competition has a strong focus on reproducibility of its results and its main goal is to provide an overview of the current state of the art in the area of automatic test-generation. The competition was based on 3 173 test-generation tasks for C programs. Each test-generation task consisted of a program and a test specification (error coverage, branch coverage). Test-Comp 2021 had 11 participating test generators from 6 countries.


Introduction
Among several other objectives, the Competition on Software Testing (Test-Comp [4,5,6], https://test-comp.sosy-lab.org/2021) showcases every year the state of the art in the area of automatic software testing. This edition of Test-Comp is the 3rd edition of the competition. It provides an overview of the currently achieved results by tool implementations that are based on the most recent ideas, concepts, and algorithms for fully automatic test generation. This competition report describes the (updated) rules and definitions, presents the competition results, and discusses some interesting facts about the execution of the competition experiments. The setup of Test-Comp is similar to SV-COMP [8], in terms of both technical and procedural organization. The results are collected via BenchExec's XML results format [16], and transformed into tables and plots in several formats (https://test-comp.sosy-lab.org/2021/results/). All results are available in artifacts at Zenodo (Table 3).
Competition Goals. In summary, the goals of Test-Comp are the following [5]: • Establish standards for software test generation. This means, most prominently, to develop a standard for marking input values in programs, define an exchange format for test suites, agree on a specification language for test-coverage criteria, and define how to validate the resulting test suites. • Establish a set of benchmarks for software testing in the community. This means to create and maintain a set of programs together with coverage criteria, and to make those publicly available for researchers to be used in performance comparisons when evaluating a new technique. • Provide an overview of available tools for test-case generation and a snapshot of the state-of-the-art in software testing to the community. This means to compare, independently from particular paper projects and specific techniques, different test generators in terms of effectiveness and performance. • Increase the visibility and credits that tool developers receive. This means to provide a forum for presentation of tools and discussion of the latest technologies, and to give the participants the opportunity to publish about the development work that they have done. • Educate PhD students and other participants on how to set up performance experiments, package tools in a way that supports reproduction, and how to perform robust and accurate research experiments. • Provide resources to development teams that do not have sufficient computing resources and give them the opportunity to obtain results from experiments on large benchmark sets.
Related Competitions. In the field of formal methods, competitions are respected as an important evaluation method and there are many competitions [2]. We refer to the previous report [5] for a more detailed discussion and give here only the references to the most related competitions [2,8,32,39].
Quick Summary of Changes. As the competition continuously improves, we report the changes since the last report. We list a summary of five new items in Test-Comp 2021 as overview: Organizational aspects such as the classification (automatic, off-site, reproducible, jury, training) and the competition schedule is given in the initial competition definition [4]. In the following, we repeat some important definitions that are necessary to understand the results. Execution of a Test Generator. Figure 1 illustrates the process of executing one test generator on the benchmark suite. One test run for a test generator gets as input (i) a program from the benchmark suite and (ii) a test specification (cover bug, or cover branches), and returns as output a test suite (i.e., a set of test cases). The test generator is contributed by a competition participant as a software archive in ZIP format. The test runs are executed centrally by the competition organizer. The test-suite validator takes as input the test suite from the test generator and validates it by executing the program on all test cases: for bug finding it checks if the bug is exposed and for coverage it reports the coverage. We use the tool TestCov [15] 2 as test-suite validator.
Test Specification. The specification for testing a program is given to the test generator as input file (either properties/coverage-error-call.prp or properties/coverage-branches.prp for Test-Comp 2021).
The definition init(main()) is used to define the initial states of the program under test by a call of function main (with no parameters). The definition FQL(f) specifies that coverage definition f should be achieved. The FQL (FShell query language [28]) coverage definition COVER EDGES(@DECISIONEDGE) means that all branches should be covered (typically used to obtain a standard test suite for quality assurance) and COVER EDGES(@CALL(foo)) means that a call (at least one) to function foo should be covered (typically used for bug finding). A complete specification looks as follows: COVER( init(main()), FQL(COVER EDGES(@DECISIONEDGE)) ).   Table 1 lists the two FQL formulas that are used in test specifications of Test-Comp 2021; there was no change from 2020 (except that special function __VERIFIER_error does not exist anymore).
Task-Definition Format 2.0. The format for the task definitions in the SV-Benchmarks repository was extended by options that can carry information from the test-generation task to the test tool. Test-Comp 2021 used the format in version 2.0 (https://gitlab.com/sosy-lab/benchmarking/task-definition-format/-/tree/2.0). The options now contain the language (C or Java) and the data model (ILP32, LP64, see http://www.unix.org/whitepapers/64bit.html, only for C programs) that the program of the test-generation task assumes (https://github.com/sosy-lab/sv-benchmarks#task-definitions). An example task definition is provided in Fig. 2: This YAML file specifies, for the C program floppy.i.cil-3.c, two verification tasks (reachability of a function call and memory safety) and one test-generation task (coverage of all branches). Previously, the options for language and data model where defined in category-specific configuration files (for example c/ReachSafety-ControlFlow.cfg), which were deleted before Test-Comp 2021.
License and Qualification. The license of each participating test generator must allow its free use for reproduction of the competition results. Details on qualification criteria can be found in the competition report of Test-Comp 2019 [6]. Furthermore, the community tries to apply the SPDX standard (https://spdx.dev) to the SV-Benchmarks repository. Continuous-integration checks based on REUSE (https://reuse.software) will ensure that all benchmark tasks adhere to the standard.

Categories and Scoring Schema
Benchmark Programs. The input programs were taken from the largest and most diverse open-source repository of software-verification and test-generation tasks 3 , which is also used by SV-COMP [8]. As in 2020, we selected all programs for which the following properties were satisfied (see issue on GitHub 4 and report [6]): 1. compiles with gcc, if a harness for the special methods 5 is provided, 2. should contain at least one call to a nondeterministic function, 3. does not rely on nondeterministic pointers, 4. does not have expected result 'false' for property 'termination', and 5. has expected result 'false' for property 'unreach-call' (only for category Error Coverage).
This selection yielded a total of 3 173 test-generation tasks, namely 607 tasks for category Error Coverage and 2 566 tasks for category Code Coverage. The test-generation tasks are partitioned into categories, which are listed in Tables 6 and 7 and described in detail on the competition web site. 6 Figure 3 illustrates the category composition. The programs in the benchmark collection contained functions __VERIFIER_error and __VERIFIER_assume that had a specific predefined meaning. Last year, those functions were removed from all programs in the SV-Benchmarks collection. More about the reasoning is explained in the SV-COMP 2021 competition report [8].
Category Error-Coverage. The first category is to show the abilities to discover bugs. The benchmark set consists of programs that contain a bug. Every run will be started by a batch script, which produces for every tool and every test-generation task one of the following scores: 1 point, if the validator succeeds in executing the program under test on a generated test case that explores the bug (i.e., the specified function was called), and 0 points, otherwise.  Category Branch-Coverage. The second category is to cover as many branches of the program as possible. The coverage criterion was chosen because many test generators support this standard criterion by default. Other coverage criteria can be reduced to branch coverage by transformation [27]. Every run will be started by a batch script, which produces for every tool and every test-generation task the coverage of branches of the program (as reported by TestCov [15]; a value between 0 and 1) that are executed for the generated test cases. The score is the returned coverage.
Ranking. The ranking was decided based on the sum of points (normalized for meta categories). In case of a tie, the ranking was decided based on the run time, which is the total CPU time over all test-generation tasks. Opt-out from categories was possible and scores for categories were normalized based on the number of tasks per category (see competition report of SV-COMP 2013 [3], page 597).

Reproducibility
In order to support independent reproduction of the Test-Comp results, we made all major components that are used for the competition available in public version-control repositories. An overview of the components that contribute to the reproducible setup of Test-Comp is provided in Fig. 4, and the details are given in Table 2. We refer to the report of Test-Comp 2019 [6] for a thorough description of all components of the Test-Comp organization and how we ensure that all parts are publicly available for maximal reproducibility.
In order to guarantee long-term availability and immutability of the testgeneration tasks, the produced competition results, and the produced test suites, we also packaged the material and published it at Zenodo (see Table 3). The archive for the competition results includes the raw results in BenchExec's XML exchange format, the log output of the test generators and validator, and a mapping from file names to SHA-256 hashes. The hashes of the files are useful for validating the exact contents of a file, and accessing the files inside the archive that contains the test suites.
To provide transparent access to the exact versions of the test generators that were used in the competition, all test-generator archives are stored in a public Git repository. GitLab was used to host the repository for the test-generator archives due to its generous repository size limit of 10 GB.
Competition Workflow. As illustrated in Fig. 4, the ingredients for a test or verification run are (a) a test or verification task (which program and which specification to use), (b) a benchmark definition (which categories and which options to use), (c) a tool-info module (uniform way to access a tool's version string and the command line to invoke), and (d) an archive that contains all executables that are required and cannot be installed as standard Ubuntu package.
(a) Each test or verification task is defined by a task-definition file (as shown, e.g., in Fig. 2). The tasks are stored in the SV-Benchmarks repository and maintained by the verification and testing community, including the competition participants and the competition organizer.
(b) A benchmark definition defines the choices of the participating team, that is, which categories to execute the test generator on and which parameters to pass to the test generator. The benchmark definition also specifies the resource limits of the competition runs (CPU time, memory, CPU cores). The benchmark definitions are created or maintained by the teams and the organizer.   (c) A tool-info module is a component that provides a uniform way to access the test-generation or verification tool: it provides interfaces for accessing the version string of a test generator and assembles the command-line from the information given in the benchmark definition and task definition. The tool-info modules are written by the participating teams with the help of the BenchExec maintainer and others.
(d) A test generator is provided as an archive in ZIP format. The archive contains a directory with a README and LICENSE file as well as all components that are necessary for the test generator to be executed. This archive is created by the participating team and merged into the central repository via a merge request.
All above components are reviewed by the competition jury and improved according to the comments from the reviewers by the teams and the organizer. Due to the reproducibility requirements and high level of automation that is necessary for a competition like Test-Comp, participating in the competition is also a challenge itself: package the tool, provide meaningful log output, specify the benchmark definition, implement a tool-info module, and troubleshoot in case of problems. Test-Comp is a friendly and helpful community, and problems are reported in a GitLab issue tracker, where the organizer and the other teams help fixing the problems.
To provide participants access to the actual competition machines, the competition used CoVeriTeam [13] (https://gitlab.com/sosy-lab/software/coveriteam/) for the first time. CoVeriTeam is a tool for cooperative verification, which enables remote execution of test-generation or verification runs directly on the competition machines (among its many other features). This possibility was found to be a valuable service for trouble shooting.

Results and Discussion
For the third time, the competition experiments represent the state of the art in fully automatic test generation for whole C programs. The report helps in understanding the improvements compared to last year, in terms of effectiveness (test coverage, as accumulated in the score) and efficiency (resource consumption in terms of CPU time). All results mentioned in this article were inspected and approved by the participants.
Participating Test Generators. Table 4 provides an overview of the participating test generators and references to publications, as well as the team representatives of the jury of Test-Comp 2021. (The competition jury consists of the chair and one member of each participating team.) Table 5 lists the features and technologies that are used in the test generators. An online table with information about all participating systems is provided on the competition web site. 7   [17] (integrated in BenchExec [16]). Further technical parameters of the competition machines are available in the repository which also contains the benchmark definitions. 9 One complete test-generation execution of the competition consisted of 34 903 single test-generation runs. The total CPU time was 220 days and the consumed energy 56 kWh for one complete competition run for test generation (without validation). Test-suite validation consisted of 34 903 single test-suite validation runs. The total consumed CPU time was 6.3 days. Each tool was executed several times, in order to make sure no installation issues occur during the execution. Including preruns, the infrastructure managed a total of 210 632 test-generation runs (consuming 1.8 years of CPU time) and 207 459 test-suite validation runs (consuming 27 days of CPU time). We did not measure the CPU energy during preruns.
Quantitative Results. Table 6 presents the quantitative overview of all tools and all categories. The head row mentions the category and the number of testgeneration tasks in that category. The tools are listed in alphabetical order; every table row lists the scores of one test generator. We indicate the top three candidates by formatting their scores in bold face and in larger font size. An empty table cell means that the test generator opted-out from the respective main category (perhaps participating in subcategories only, restricting the evaluation to a specific topic). More information (including interactive tables, quantile plots for every category, and also the raw data in XML format) is available on the competition web site 10 and in the results artifact (see Table 3). Table 7 reports the top three test generators for each category. The consumed run time (column 'CPU Time') is given in hours and the consumed energy (column 'Energy') is given in kWh.
Score-Based Quantile Functions for Quality Assessment. We use scorebased quantile functions [16] because these visualizations make it easier to understand the results of the comparative evaluation. The web site 10 and the results artifact (Table 3) include such a plot for each category; as example, we show the plot for category Overall (all test-generation tasks) in Fig. 5. All 11 test generators participated in category Overall, for which the quantile plot shows the overall performance over all categories (scores for meta categories are normalized [3]). A more detailed discussion of score-based quantile plots for testing is provided in the previous competition report [6].
Alternative Rankings. More details were given previously [6]. The graphs are decorated with symbols to make them better distinguishable without color. Green Testing -Low Energy Consumption. Since a large part of the cost of test generation is caused by the energy consumption, it might be important to also consider the energy efficiency in rankings, as complement to the official Test-Comp ranking. This alternative ranking category uses the energy consumption per score point as rank measure: CPU Energy Quality , with the unit kilo-joule per Year Evaluated test generators score point (kJ/sp). 11 The energy is measured using CPU Energy Meter [17], which we use as part of BenchExec [16].

New Test Generators.
To acknowledge the test generators that participated for the first time in Test-Comp, the second alternative ranking category lists measures only for the new test generators, and the rank measure is the quality with the unit score point (sp). For example, CMA-ES Fuzz is an early prototype and has already obtained a total score of 411 points in category Cover-Branches, and FuSeBMC is a new tool based on some mature components and became second place already in its first participation. This should encourage developers of test generators to participate with new tools of any maturity level.

Conclusion
Test-Comp 2021 was the the 3rd edition of the Competition on Software Testing, and attracted 11 participating teams (see Fig. 6 for the participation numbers and Table 4 for the details). The competition offers an overview of the state of the art in automatic software testing for C programs. The competition does not only execute the test generators and collect results, but also validates the achieved coverage of the test suites, based on the latest version of the test-suite validator TestCov. As before, the jury and the organizer made sure that the competition follows the high quality standards of the FASE conference, in particular with respect to the important principles of fairness, community support, and transparency.
Data Availability Statement. The test-generation tasks and results of the competition are published at Zenodo, as described in Table 3. All components and data that are necessary for reproducing the competition are available in public version repositories, as specified in Table 2. Furthermore, the results are presented online on the competition web site for easy access: https://test-comp.sosy-lab.org/2021/results/.
11 Errata: Table 8 of last year's report for Test-Comp 2020 contains a typo: The unit of the energy consumption per score point is kJ/sp (instead of J/sp).