Keywords

figure a
figure b

1 Introduction

In its 5th edition, the International Competition on Software Testing (Test-Comp, https://test-comp.sosy-lab.org, [7,8,9,10,11]) again compares automatic test-suite generators for C programs, in order to showcase the state of the art in the area of automatic software testing. This competition report is an update of the previous reports, referring to the rules and definitions, presents the competition results, and give some interesting data about the execution of the competition experiments. We use BenchExec [24] to execute the benchmarks and the results are presented in tables and graphs on the competition web site (https://test-comp.sosy-lab.org/2023/results) and are available in the accompanying archives (see Table 3).

Competition Goals. In summary, the goals of Test-Comp are the following [8]:

  • Establish standards for software test generation. This means, most prominently, to develop a standard for marking input values in programs, define an exchange format for test suites, agree on a specification language for test-coverage criteria, and define how to validate the resulting test suites.

  • Establish a set of benchmarks for software testing in the community. This means to create and maintain a set of programs together with coverage criteria, and to make those publicly available for researchers to be used in performance comparisons when evaluating a new technique.

  • Provide an overview of available tools for test-case generation and a snapshot of the state-of-the-art in software testing to the community. This means to compare, independently from particular paper projects and specific techniques, different test generators in terms of effectiveness and performance.

  • Increase the visibility and credits that tool developers receive. This means to provide a forum for presentation of tools and discussion of the latest technologies, and to give the participants the opportunity to publish about the development work that they have done.

  • Educate PhD students and other participants on how to set up performance experiments, package tools in a way that supports reproduction, and how to perform robust and accurate research experiments.

  • Provide resources to development teams that do not have sufficient computing resources and give them the opportunity to obtain results from experiments on large benchmark sets.

Related Competitions. In the field of formal methods, competitions are respected as an important evaluation method and there are many competitions [5]. We refer to the report from Test-Comp 2020 [8] for a more detailed discussion and give here only the references to the most related competitions [5, 13, 46, 48].

2 Definitions, Formats, and Rules

Organizational aspects such as the classification (automatic, off-site, reproducible, jury, training) and the competition schedule is given in the initial competition definition [7]. In the following, we repeat some important definitions that are necessary to understand the results.

Test-Generation Task. A test-generation task is a pair of an input program (program under test) and a test specification. A test-generation run is a non-interactive execution of a test generator on a single test-generation task, in order to generate a test suite according to the test specification. A test suite is a sequence of test cases, given as a directory of files according to the format for exchangeable test-suites.Footnote 1

Fig. 1.
figure 1

Flow of the Test-Comp execution for one test generator (taken from [8])

Execution of a Test Generator. Figure 1 illustrates the process of executing one test-suite generator on the benchmark suite. One test run for a test-suite generator gets as input (i) a program from the benchmark suite and (ii) a test specification (cover bug, or cover branches), and returns as output a test suite (i.e., a set of test cases). The test generator is contributed by a competition participant as a software archive in ZIP format. The test runs are executed centrally by the competition organizer. The test-suite validator takes as input the test suite from the test generator and validates it by executing the program on all test cases: for bug finding it checks if the bug is exposed and for coverage it reports the coverage. We use the tool TestCov [23]Footnote 2 as test-suite validator.

Test Specification. The specification for testing a program is given to the test generator as input file (either properties/coverage-error-call.prp or properties/coverage-branches.prp for Test-Comp 2023).

The definition init(main()) is used to define the initial states of the program under test by a call of function main (with no parameters). The definition FQL(f) specifies that coverage definition f should be achieved. The FQL (FShell query language [36]) coverage definition COVER EDGES(@DECISIONEDGE) means that all branches should be covered (typically used to obtain a standard test suite for quality assurance) and COVER EDGES(@CALL(foo)) means that a call (at least one) to function foo should be covered (typically used for bug finding). A complete specification looks like: COVER(init(main()), FQL(COVER EDGES(@DECISIONEDGE))).

Table 1 lists the two FQL formulas that are used in test specifications of Test-Comp 2023; there was no change from 2020 (except that special function __VERIFIER_error does not exist anymore).

Table 1. Coverage specifications used in Test-Comp 2023 (similar to 2019–2022)

Task-Definition Format 2.0. Test-Comp 2023 used again the task-definition format in version 2.0.

License and Qualification. The license of each participating test generator must allow its free use for reproduction of the competition results. Details on qualification criteria can be found in the competition report of Test-Comp 2019 [9].

3 Categories and Scoring Schema

Benchmark Programs. The input programs were taken from the largest and most diverse open-source repository of software-verification and test-generation tasksFootnote 3, which is also used by SV-COMP [13]. As in 2020 and 2021, we selected all programs for which the following properties were satisfied (see issue on GitLabFootnote 4 and report [9]):

  1. 1.

    compiles with gcc, if a harness for the special methodsFootnote 5 is provided,

  2. 2.

    should contain at least one call to a nondeterministic function,

  3. 3.

    does not rely on nondeterministic pointers,

  4. 4.

    does not have expected result ‘false’ for property ‘termination’, and

  5. 5.

    has expected result ‘false’ for property ‘unreach-call’ (only for category Error Coverage).

This selection yielded a total of 4 106 test-generation tasks, namely 1 173 tasks for category Error Coverage and 2 933 tasks for category Code Coverage. The test-generation tasks are partitioned into categories, which are listed in Tables 6 and 7 and described in detail on the competition web site.Footnote 6 Figure 2 illustrates the category composition.

Fig. 2.
figure 2

Category structure for Test-Comp 2023; compared to Test-Comp 2022, sub-category Hardware was added to main category Cover-Error

Category Error-Coverage. The first category is to show the abilities to discover bugs. The benchmark set consists of programs that contain a bug. We produce for every tool and every test-generation task one of the following scores: 1 point, if the validator succeeds in executing the program under test on a generated test case that explores the bug (i.e., the specified function was called), and 0 points, otherwise.

Category Branch-Coverage. The second category is to cover as many branches of the program as possible. The coverage criterion was chosen because many test generators support this standard criterion by default. Other coverage criteria can be reduced to branch coverage by transformation [35]. We produce for every tool and every test-generation task the coverage of branches of the program (as reported by TestCov [23]; a value between 0 and 1) that are executed for the generated test cases. The score is the returned coverage.

Ranking. The ranking was decided based on the sum of points (normalized for meta categories). In case of a tie, the ranking was decided based on the run time, which is the total CPU time over all test-generation tasks. Opt-out from categories was possible and scores for categories were normalized based on the number of tasks per category (see competition report of SV-COMP 2013 [6], page 597).

4 Reproducibility

We followed the same competition workflow that was described in detail in the previous competition report (see Sect. 4, [10]). All major components that were used for the competition were made available in public version-control repositories. An overview of the components that contribute to the reproducible setup of Test-Comp is provided in Fig. 3, and the details are given in Table 2. We refer to the report of Test-Comp 2019 [9] for a thorough description of all components of the Test-Comp organization and how we ensure that all parts are publicly available for maximal reproducibility.

In order to guarantee long-term availability and immutability of the test-generation tasks, the produced competition results, and the produced test suites, we also packaged the material and published it at Zenodo (see  Table 3).

Fig. 3.
figure 3

Benchmarking components of Test-Comp and competition’s execution flow (same as for Test-Comp 2020)

Table 2. Publicly available components for reproducing Test-Comp 2023
Table 3. Artifacts published for Test-Comp 2023

The competition used CoVeriTeam [20]Footnote 7 again to provide participants access to execution machines that are similar to actual competition machines. The competition report of SV-COMP 2022 provides a description on reproducing individual results and on trouble-shooting (see Sect. 3, [12]).

Table 4. Competition candidates with tool references and representing jury members; indicates first-time participants, indicates hors-concours participation

5 Results and Discussion

This section represents the results of the competition experiments. The report shall help to understanding the state of the art and the advances in fully automatic test generation for whole C programs, in terms of effectiveness (test coverage, as accumulated in the score) and efficiency (resource consumption in terms of CPU time). All results mentioned in this article were inspected and approved by the participants.

Table 5. Technologies and features that the test generators used

Participating Test-Suite Generators. Table 4 provides an overview of the participating test generators and references to publications, as well as the team representatives of the jury of Test-Comp 2023. (The competition jury consists of the chair and one member of each participating team.) An online table with information about all participating systems is provided on the competition web site.Footnote 8 Table 5 lists the features and technologies that are used in the test generators.

There are test generators that did not actively participate (e.g., tester archives taken from last year) and that are not included in rankings. Those are called hors-concours participations and the tool names are labeled with a symbol ().

Computing Resources. The computing environment and the resource limits were the same as for Test-Comp 2020 [8], except for the upgraded operating system: Each test run was limited to 8 processing units (cores), 15 GB of memory, and 15 min of CPU time. The test-suite validation was limited to 2 processing units, 7 GB of memory, and 5 min of CPU time. The machines for running the experiments are part of a compute cluster that consists of 168 machines; each test-generation run was executed on an otherwise completely unloaded, dedicated machine, in order to achieve precise measurements. Each machine had one Intel Xeon E3-1230 v5 CPU, with 8 processing units each, a frequency of 3.4 GHz, 33 GB of RAM, and a GNU/Linux operating system (x86_64-linux, Ubuntu 22.04 with Linux kernel 5.15). We used BenchExec [24] to measure and control computing resources (CPU time, memory, CPU energy) and VerifierCloudFootnote 9 to distribute, install, run, and clean-up test-case generation runs, and to collect the results. The values for time and energy are accumulated over all cores of the CPU. To measure the CPU energy, we use CPU Energy Meter [25] (integrated in BenchExec [24]). Further technical parameters of the competition machines are available in the repository which also contains the benchmark definitions.Footnote 10

One complete test-generation execution of the competition consisted of 50 445 single test-generation runs in 25 run sets (tester \(\times \) property). The total CPU time was 315 days and the consumed energy 89.9 kWh for one complete competition run for test generation (without validation). Test-suite validation consisted of 53 378 single test-suite validation runs in 26 run sets (validator \(\times \) property). The total consumed CPU time was 19 days. Each tool was executed several times, in order to make sure no installation issues occur during the execution. Including preruns, the infrastructure managed a total of 254 445 test-generation runs (consuming 3.0 years of CPU time). The prerun test-suite validation consisted of 338 710 single test-suite validation runs in 152 run sets (validator \(\times \) property) (consuming 63 days of CPU time). The CPU energy was not measured during preruns.

Table 6. Quantitative overview over all results; empty cells mark opt-outs; indicates first-time participants, indicates hors-concours participation
Table 7. Overview of the top-three test generators for each category (measurement values for CPU time and energy rounded to two significant digits)
Table 8. New test-suite generators in Test-Comp 2022 and Test-Comp 2023; column ‘Sub-categories’ gives the number of executed categories

New Test-Suite Generators. To acknowledge the test-suite generators that participated for the first time in Test-Comp, we list the test generators that participated for the first time. , , and participated for the first time in Test-Comp 2023, and Legion/SymCC  participated first in Test-Comp 2022. Table 8 reports also the number of subcategories in which the tools participated.

Fig. 4.
figure 4

Number of evaluated test generators for each year (top: number of first-time participants; bottom: previous year’s participants)

Quantitative Results. The quantitative results are presented in the same way as last year: Table 6 presents the quantitative overview of all tools and all categories. The head row mentions the category and the number of test-generation tasks in that category. The tools are listed in alphabetical order; every table row lists the scores of one test generator. We indicate the top three candidates by formatting their scores in bold face and in larger font size. An empty table cell means that the test generator opted-out from the respective main category (perhaps participating in subcategories only, restricting the evaluation to a specific topic). More information (including interactive tables, quantile plots for every category, and also the raw data in XML format) is available on the competition web siteFootnote 11 and in the results artifact (see Table 3). Table 7 reports the top three test generators for each category. The consumed run time (column ‘CPU Time’) is given in hours and the consumed energy (column ‘Energy’) is given in kWh.

Fig. 5.
figure 5

Quantile functions for category Overall. Each quantile function illustrates the quantile (x-coordinate) of the scores obtained by test-generation runs below a certain number of test-generation tasks (y-coordinate). More details were given previously [9]. The graphs are decorated with symbols to make them better distinguishable without color.

Score-Based Quantile Functions for Quality Assessment. We use score-based quantile functions [24] because these visualizations make it easier to understand the results of the comparative evaluation. The web site \(^{11}\) and the results artifact (Table 3) include such a plot for each category; as example, we show the plot for category Overall (all test-generation tasks) in Fig. 5. We had 11 test generators participating in category Overall, for which the quantile plot shows the overall performance over all categories (scores for meta categories are normalized [6]). A more detailed discussion of score-based quantile plots for testing is provided in the Test-Comp 2019 competition report [9].

6 Conclusion

The Competition on Software Testing took place for the 5th time and provides an overview of fully-automatic test-generation tools for C programs. A total of 13 test-suite generators was compared (see Fig. 4 for the participation numbers and Table 4 for the details). This off-site competition uses a benchmark infrastructure that makes the execution of the experiments fully-automatic and reproducible. Transparency is ensured by making all components available in public repositories and have a jury (consisting of members from each team) that oversees the process. All test suites were validated by the test-suite validator TestCov [23] to measure the coverage. The results of the competition are presented at the 26th International Conference on Fundamental Approaches to Software Engineering at ETAPS 2023.