First international competition on software testing

Tool competitions are a special form of comparative evaluation, where each tool has a team of developers or supporters associated that makes sure the tool is properly configured to show its best possible performance. In several research areas, tool competitions have been a driving force for the development of mature tools that represent the state of the art in their field. This paper describes and reports the results of the 1st\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$^{\text {st}}$$\end{document} International Competition on Software Testing (Test-Comp 2019), a comparative evaluation of automatic tools for software test generation. Test-Comp 2019 was presented as part of TOOLympics 2019, a satellite event of the conference TACAS. Nine test generators were evaluated on 2 356 test-generation tasks. There were two test specifications, one for generating a test that covers a particular function call and one for generating a test suite that tries to cover the branches of the program.


Introduction
Software testing is as old as software development itself, because the easiest way to find out whether software works is to test it. In the last few decades, the tremendous breakthrough of theorem provers and satisfiability-modulo-theory (SMT) solvers has led to the development of efficient tools for automatic test generation. For example, symbolic execution and the idea to use it for test generation [30] exist for more than 40 years, but efficient implementations (e.g., Klee [16,17]) had to wait for the availability of mature constraint solvers. Also, with the advent of automatic software model checking the opportunity to extract tests from counterexamples arose (see Blast [10] and JPF [36]). In the following years, many techniques from the areas of model checking and program analysis were adopted for the purpose of test generation and several strong hybrid combinations have been developed [23].
While several powerful software test generators are available [23], they are very difficult to compare. For example, A preliminary version was published in Proc. TACAS 2019 [6].
B Dirk Beyer dirk.beyer@sosy-lab.org 1 LMU Munich, Oettingenstr. 67, 80538 Munich, Germany a recent study [12] first had to develop a framework that supports to run test generators on the same program source code and to deliver tests in a common format for validation. Furthermore, there is no widely distributed benchmark suite available and neither input programs nor output test suites follow a standard format. In software verification, the competition SV-COMP [5] helped to overcome similar problems: the competition community developed standards for defining nondeterministic functions and a language to write specifications (so far for C and Java programs) and established a standard exchange format for the output (witnesses). The competition also helped to adequately give credits to PhD students and PostDocs for their engineering efforts and technical contributions. A competition event with high visibility can foster the transfer of theoretical and conceptual advancements in software testing into practical tools and also gives credits and benefits to students who spend considerable amounts of time developing testing algorithms and software tools. Successful participation in competitions indicates qualification. Comparative overviews are helpful for engineers when selecting test tools for their purpose.
Test-Comp is designed to compare automatic state-of-theart software test generators with respect to effectiveness and efficiency. This comprises a preparation phase in which a set of benchmark programs is collected and classified (accord-834 D. Beyer ing to application domain, kind of bug to find, coverage criterion to fulfill, theories needed), in order to derive competition categories. After the preparation phase, the tools are submitted, installed, and run on the set of benchmark tasks.
Test-Comp uses the benchmarking framework BenchExec [14], which is already successfully used in other competitions, most prominently, all competitions that run on the StarExec infrastructure [35]. Similar to SV-COMP, the test generators in Test-Comp are applied to programs in a fully automatic way. The results are collected via the BenchExec results format and transformed into tables and plots in several formats.
Competition goals. In summary, the most important goals of the competition Test-Comp are the following: • Establish a set of benchmarks for software testing in the community. This means to create and maintain a set of well-defined programs together with coverage criteria, and to make those publicly available for researchers to be used in performance comparisons when evaluating a new algorithm, technology, or implementation. There are two competitions for automatic and off-site testing: Rode0day 5 is a competition that is meant as a continuously running evaluation on bug-finding in binaries (currently Grep and SQLite). The unit-testing tool competition [29] 6 is part of the SBST workshop and compares tools for unit-test generation on Java programs.
So far, there was no comparative evaluation of automatic test generators in a controlled environment in which the tool developers were involved as participants and jury. Test-Comp [6] 7 is meant to close this gap. The results of the first edition of Test-Comp were presented as part of the TOOLympics 2019 event [1], where 16 competitions in the area of formal methods were presented.

Organizational classification and schedule
The competition Test-Comp is designed according to the model of SV-COMP [2], the International Competition on Software Verification. Classification. Test-Comp shares the following organizational principles: • Automatic. The tools are executed in a fully automated environment, without any user interaction. • Off-site. The competition takes place independently from a conference location, in order to flexibly allow problem solving and organizational changes. • Reproducible. The experiments are controlled and reproducible, that is, the resources are limited, controlled, measured, and logged.  Test-Comp  execution for one test generator;  the left side depicts a  test-generation run; the right  side depicts a test-validation run • Jury. The jury is the advisory board of the competition, is responsible for qualification decisions on tools and benchmarks, and serves as program committee for the reviewing and selection of papers to be published in conference proceedings or a journal. The jury ensures transparency of the competition organization and judges qualification of participants (but not their performance, which is computed using a scoring schema from the results, see Sect. 5). The jury is also responsible for new competition rules and deciding on new categories. • Training. The competition flow includes a training phase during which the participants get a chance to train their tools on the potential benchmark instances and during which the organizer ensures a smooth competition execution, giving preliminary feedback to the participating teams.

Schedule.
A typical Test-Comp schedule has the following deadlines and phases: • Call for participation. The organizer announces the competition on the mailing list. 8 • Registration of participation and training phase. The tool developers register for participation and submit a first version of their tool together with documentation to the competition. The tool can later be updated and is used for pre-runs by the organizer and for qualification assessment by the jury. Preliminary results are reported to the tool developers and made available to the jury.  Execution of a test generator. Figure 1 illustrates the process of executing one test generator on one test-generation task. One test-generation run for a test generator gets as input (i) a program from the benchmark suite and (ii) a test specification (find bug, or coverage criterion), and returns as output a test suite (i.e., a set of tests). The test generator is contributed by the competition participant. The test-generation runs are executed centrally by the competition organizer. The test validator takes as input the test suite from the test generator and validates it by executing the program on all test-generation tasks: for bug finding it checks whether the bug is exposed and for coverage it reports the coverage using the GNU tool gcov. 10 Test specification. FQL(COVER EDGES(@CALL(__VERIFIER_error))) ) that executes function __VERIFIER_error.

Cover-Branches COVER( init(main(),
The test suite contains tests such that FQL(COVER EDGES(@DECISIONEDGE)) ) all branches of the program are executed.
ator as input file (either properties/coverage-error-call.prp or properties/coverage-branches.prp for Test-Comp). The definition init(main()) is used to define the entry of the program under test. The definition FQL(f) specifies that coverage definition f should be achieved. The FQL (FShell query language [25]) coverage definition COVER EDGES(@DECISIONEDGE) means that all branches should be covered, COVER EDGES(@BASIC BLOCKENTRY) means that all statements should be covered, and COVER EDGES(@CALL(__VERIFIER_error)) means that function __VERIFIER_error should be called. A complete specification looks like those in Table 1.
License requirements for submitted test-generator archives. The test generators need to be publicly available for download as binary archive under a license that allows the following (cf. [5]): • reproduction and evaluation by anybody (including results publication), • no restriction on the usage of the verifier output (log files, witnesses), and • any kind of (re-)distribution of the unmodified verifier archive.
Qualification. Before a tool or person can participate in the competition, the jury evaluates the following qualification criteria. Tool. A test tool is qualified to participate as competition candidate if the tool is (a) publicly available for download and fulfills the above license requirements, (b) works on the GNU/Linux platform (more specifically, it must run on an x86_64 machine), (c) is installable with user privileges (no root access required, except for required standard Ubuntu packages) and without hard-coded absolute paths for access to libraries and nonstandard external tools, (d) succeeds for more than 50 % of all training programs to parse the input and start the test process (a tool crash during the test-generation phase does not disqualify), and (e) produces test suites that adhere to the exchange format (see above).
Person. A person (participant) is qualified as competition contributor for a competition candidate if the person (a) is a contributing designer/developer of the submitted competition candidate (witnessed by occurrence of the person's name on the tool's project web page, a tool paper, or in the revision logs) or (b) is authorized by the competition organizer (after the designer/developer was contacted about the participation).

Benchmark programs and categories
The first edition of Test-Comp is based on programs written in the programming language C. The input programs are taken from the largest and most diverse open-source repository of software verification and test-generation tasks 11 , which is also used by SV-COMP [5]. Selection. We selected all programs for which the following properties were satisfied (cf. issue on GitHub 12 ): 1. compiles with gcc, if a harness for the special inputproviding nondeterministic methods is provided, 2. contains at least one call to a such an input-providing nondeterministic function, 3. does not rely on nondeterministic pointers, 4. does not have expected verdict false for a 'termination' specification, and 5. has expected verdict false for an 'unreach-call' specification (only for category Cover-Error).
This selection yields a total of 2 356 test-generation tasks, namely 636 test-generation tasks for category Cover-Error and 1 720 test-generation tasks for category Cover-Branches. We now explain the above requirements in more detail: (1) It is necessary to be able to compile and link the program because we can execute a program only if all declared functions are implemented. (We need to execute the compiled programs on tests in order to measure coverage to evaluate the test suites produced by the test generators.) According to the specification of the benchmark repository, there are several unimplemented functions, which are meant to feed test inputs. Those functions have a name of the form __VERIFIER_nondet_X(), where X is a type from the set { bool, char, int, float, double, loff_t, long, pchar, pthread_t, sector_t, short, size_t, u32, uchar, uint, ulong, unsigned, and ushort } and the implementation can be assumed to return an arbitrary (nondeterministic) value of that type, without any side effects.
(2) The programs that we use for Test-Comp need to have at least one call of such a function that returns a nondeterministic value, in order to be able to identify the test inputs of the program and to later feed the test values to the program when executing it.
(3) The specification of the benchmark repository also knows special functions to return nondeterministic values for pointers (type void *), which are meant for verification based on model checking (used in SV-COMP). We do not use programs with such function calls in Test-Comp, because they often introduce undefined behavior. Those calls will be elim-inated in the benchmark repository in the future, from 2020 onward, in order to avoid undefined behavior in verification and test-generation tasks. (4) We exclude from Test-Comp all programs in the benchmark repository that have nonterminating executions. Those programs are meant for evaluating verification tools that detect nontermination. The verdict for the behavioral specification for termination is available in the task-definition files in the repository. (5) For category Cover-Error, the task definition needs to contain the verdict false for the behavioral specification that a certain function call is not reachable. Otherwise, if the call is not reachable, then the task is not relevant for category Cover-Error of Test-Comp.

D. Beyer
Categories. The test-generation tasks are partitioned into categories. Figure 2 illustrates the structure of the category composition. The results (in Tables 6 and 7) are listed according to the main categories. Category C-Overall consists of the two main categories Cover-Error and Cover-Branches (according to Table 1), which in turn consist of the following subcategories (same for both main categories in 2019): Arrays, BitVectors, ControlFlow, ECA, Floats, Heap, Loops, Recursive, and Sequentialized. The detailed definition of the categories (which test-generation tasks are contained in which subcategory) is available on the competition web site. 13 The main categories partition the test-generation tasks according to the test specification, that is, whether to generate a test suite for covering a single bug or to generate a test suite for covering as many branches as possible. The subcategories are structured based on the features of the programs that the test generators need to support: programs with arrays, with bit-vector arithmetic that cannot be approximated as linear arithmetic, with control flow that matters for the behavior, with a certain style of programming for event-conditionaction (ECA) systems, with floating-point arithmetics, with data structures on the heap, with loops that are important to be analyzed, with recursive function calls, and programs that result from a transformation of multi-threaded programs to sequential programs.
The benchmark collection SV-Benchmarks contains benchmark sets of C programs (c/), Java programs (java/), and Horn clauses (clauses/). Test-Comp 2019 used only programs written in C. The C collection consists of many subdirectories, in order to structure the programs according to their provenance and features. Each directory usually contains a README file with a description of the contents and a LICENSE file (link) to declare the license of the programs. The subcategories are defined in category-definition files (.set). For example, the subcategory Arrays is defined by the file c/ReachSafety-Arrays.set. The above-mentioned web page 13 is generated from those category-definition files. The category-configuration files (.cfg) provide a short description of the subcategory and important information about the programs in the subcategory, most importantly, the bit architecture. For example, the category configuration for subcategory Arrays is contained in the file c/ReachSafety-Arrays.cfg .

Scoring schema
Every test-generation run will be executed in the execution environment of the competition according to the flow in Fig. 1, which produces for every test generator and every test-generation task (which is a pair of a C program and 13 https://test-comp.sosy-lab.org/2019/benchmarks.php a test specification) a coverage value, which is a value in the interval [0, 1]. The coverage values are also called score points.
Evaluation by scores and runtime. The participating test generators are ranked according to the cumulative coverage (sum of score points). Test generators with the same cumulative coverage are ranked according to success runtime. The success runtime for a test generator is the total CPU time over all test-generation runs for which the test generator successfully produced a test suite.
Cover-Error. The first category is to show the abilities to discover bugs. The benchmark set consists of programs that contain a bug. The coverage value is defined to be either 0 or 1, as follows: 1 if the program under test is executed on a generated test that explores the bug (i.e., specified function was called)

otherwise
Cover-Branches. The second category is to cover as many branches as possible. The coverage criterion was chosen because many test generators support this standard criterion by default. Other coverage criteria can be reduced to branch coverage by transformation [24]. The coverage value (as reported by gcov 10  Note that we measured what gcov calls branch coverage. In our experiments we discovered that what gcov reports is in fact not branch coverage, but measurement values that are closer to what is usually referred to as condition coverage. Therefore, the next Test-Comp uses measurements as reported by TestCov [13], which implements the usual definition of branch coverage. Opt out. It is possible for participants to opt out from certain categories that are not supported by the test generator. In this case, the tables would show no result (empty table cell). In Test-Comp 2019, all teams participated in all categories. Esbmc did not support branch coverage and therefore the table displays a zero as result for category Cover-Branches (see Table 6). Normalization of scores. Since the main categories are composed of subcategories, and the subcategories contain different numbers of test-generation tasks, there would be  Fig. 1, the program under test and test specification are defined by the test-generation task a, the test generator is taken from the test-generator archive d, and the test suite is stored in an archive f for later evaluation by a test-validation run, which also works with the components depicted in the above figure a bias toward subcategories with a large number of testgeneration tasks. In other words, without normalization, it would maximize the score to work on categories that consist of many similar programs. However, we do not want to stipulate that one category is more important than another. Thus, we need to normalize the score, such that all subcategories have the same influence on the final result. The goal is to reduce the influence of a test-generation task in a large category compared to a test-generation task in a small category, and thus, balance over the categories. We use the normalization that is also used by SV-COMP (see competition report of SV-COMP 2013 [3], page 597): The score for a meta category is computed from the scores of all k contained (sub-) categories using a normalization by the number of contained test-generation tasks: The normalized score sn i of a test generator in category i is obtained by dividing the score s i by the number of tasks n i in category i (sn i = s i /n i ), then the sum k i=1 sn i over the normalized scores of the categories is multiplied by the average number of tasks per category. An example calculation can be found on the web page of SV-COMP. 14 14 https://sv-comp.sosy-lab.org/2019/rules.php#meta

Components for reproducibility
Reproducibility of the results is a main concern of a competition like Test-Comp. The competition must be as transparent and reproducible as possible. To achieve this goal, we duplicate the setup from SV-COMP [4] and describe here our adaptation to Test-Comp. We have to try to control all variables that might influence the results. Figure 3, in its top row, shows the input of the process of executing a test-generation run of the competition: (a) the test-generation task, (b) the benchmark definition, (c) the tool-info module, and (d) the test-generator archive. Using those four inputs, (e) the test-generation run produces (f) the resulting test suite in (g) the execution environment. Table 2 provides for each of the 7 components the repository URL and the tag to identify the precise version that was used in the competition. Table 3 lists the archives that were published on Zenodo.
Repository of test-generation tasks (a). The repository of test-generation tasks 11 is maintained by the community, using the GitHub issue tracker and pull requests to efficiently handle contributions from the contributors. The    [8].
The repository describes test-generation tasks using a task-definition file in YAML format, according to the standard: https://gitlab.com/sosy-lab/benchmarking/task-defini tion-format. For example, the task-definition file c/ntdrivers-simplified/cdaudio_simpl1.cil-1.yml refers to the C program (extracted from a device driver) c/ntdrivers-simplified/cdaudio_simpl1.cil-1.c and several test specifications, including c/properties/coverage-branches.prp, which results in a test-generation task that consists of this C program and the test specification to generate a test suite that covers all branches of the program.
Benchmark definitions (b). For executing test-generation runs, we need to set resource limits, and we need to know for each test generator, (i) which test-generation tasks need to be given to the test generator as input and (ii) which parameters need to be passed to the test generator (there are global, test-generator-specific parameters to be passed to the tool, and there is one task-specific parameter: the bit architecture). The benchmark definitions are XML files in the format that BenchExec expects; they are available in a repository. The execution of each test-generation run was limited to the resources specified in Table 4, for CPU time, RAM, and number of processing units (cores) of the CPU.
For example, the benchmark definition for CoVeriTest is shown in Fig. 4 (also available in the repository as benchmark-defs/coveritest.xml). This XML file describes first the tool-info module to be used (tool="cpachecker", see below under (c)), followed by a display name and the resource limits from Table 4. It also specifies the CPU model (cpuModel="Intel Xeon E3-1230 v5 @ 3.40 GHz") and that all 8 CPU cores shall be reserved for the testgeneration run (cpuCores="8"). The rest of the file specifies the result files, the options for CoVeriTest, the properties, and the programs (compare with Fig. 2). A more detailed description is available in the BenchExec repository (doc/benchexec.md#defining-tasks-for-benchexec).
Tool-specific information (c). In order to correctly execute a test generator, we need to provide a tool-info module to BenchExec. The tool-info module assembles the command-line to properly invoke the test generator (including program-source and test-specification files as well as the parameters) from the parts specified in the benchmark definition (b). The tool-info modules that were used in Test-Comp 2019 are available in BenchExec release 1.18 [37].

Test-generator archives/test-validation archive (d).
The test generators are provided in an archive containing a license (that permits distribution, use in Test-Comp, and reproducing the results) and all parts that are needed to execute the test generator (statically linked executables, all components for which a certain version is required, or for which no standard Ubuntu package is available, are included). The test generators and the above-mentioned components are provided in the Test-Comp archives repository. The same holds for the test validator.
Precise controlling and measurement of resources (e). For scientifically valid experiments, we require for each test-generation run a reliable assignment and controlling of computing resources (cores, memory, CPU time), and a precise measurement. There are several requirements that experiments of a competition such as Test-Comp have to fulfill [14]: (i) accurate measurement and reliable enforcement of limits for CPU time and memory, (ii) reliable termination of processes (including all child processes), (iii) correct assignment of local memory (for NUMA architectures), and (iv) isolation of the test-generation run in a container. We used BenchExec [14] to perform all Test-Comp experiments, because this benchmarking framework lets us conveniently benefit from the modern resource-control and measurement mechanisms that the Linux kernel offers. All results, including raw measurement results, log files, and HTML files, are archived at Zenodo [7].
For example, the test suite that CoVeriTest generated for the above-mentioned test-generation task is directly accessible also on the Test-Comp web site (visit https://test-comp.so sy-lab.org/2019/results/results-verified/, click on the score (14) for column CoVeriTest and row coverage-branches. ReachSafety-ControlFlow, then in the table with the detailed results click on the cell for column test-suite and row ntdrivers-simplified/cdaudio_simpl1. cil-1.yml, to obtain the file 1bbef0df...zip). The test suite is contained in a directory test-suite/ inside the ZIP Fig. 7 Coverage plot for the discussed test suite from Test-Comp 2019 (test suite 1bbef0df...zip); the diagram shows the number of tests processed on the x-axis and the coverage in percent on the y-axis, that is, a data point (100, 60) informs us that the first 100 tests cover 60 % of the program's branches archive. The directory contains a file metadata.xml that describes the test suite and one file test...xml for each test (also called test vector). The meta-data file is shown in Fig. 5; it provides information about the language of the program, the producing engine (CoVeriTest is based on CPAchecker), the test specification, the program path, the SHA-256 hash of the program, the entry function, the data model of the CPU for which the program was written, and the creation time stamp. The first test of the test suite is shown in Fig. 6; it provides a sequence of test values to be fed into the program.
The discussed test suite contains 212 tests. During the test-validation run, the test validator takes the test suite and executes each test of the program (feeding in the values from the XML file). For the discussed test suite, the test generator is assigned a score of 0.738, because the test suite covers 73.8 % of all branches of the program. The increase in coverage by each test is illustrated in Fig. 7.
Execution environment (g) The machines for running the experiments were part of a compute cluster at LMU Munich that consists of 168 machines; each test-generation run was executed on an otherwise completely unloaded, dedicated machine, in order to achieve precise measurements. Each machine had one Intel Xeon E3-1230 v5 CPU, with 8 processing units each, a frequency of 3.4 GHz, 33 GB of RAM, and a GNU/Linux operating system (x86_64-linux, Ubuntu 18.04 with Linux kernel 4.15). Further technical parameters of the competition machines are available in the file README.md of the repository that also contains the

Results
For the first time, the competition Test-Comp 2019 presents the state of the art in fully automatic test-generation for whole C programs, using a developer-involved comparative evaluation based on controlled experiments. The results help in understanding the current achievements of the testgeneration research, in terms of effectiveness (test coverage, as accumulated in the score) and efficiency (resource consumption in terms of CPU time). All results mentioned in this article were inspected and approved by the participants. Participating tools. The automatic test generators that participated in the first edition of Test-Comp are listed in Table 5.
The table provides for each of the 9 participating systems the test-generator name (links to the project web site in the PDF version of this article), references to system descriptions, and the representing jury member, with affiliation). Quantitative results. Table 6 presents the quantitative overview of all tools and all categories. The head row mentions the category and the number of test-generation tasks in that category. The tools are listed in alphabetical order; every table row lists the scores of one test generator. We indicate the top three candidates by formatting their scores in bold face and in larger font size. More information (including interactive tables, quantile plots for every category, and also the raw data in XML format) is available on the competition web site 17 and in the results artifacts (see Table 3).   Score-based quantile functions. We use score-based quantile functions (see [14], Sect. 7.7 and 7.8, and [4], pages 899-900) for quality assessment, because these visualizations make it easier to understand the results of the comparative evaluation. The web site 17 and the results artifact (Table  3) include such a plot for each category. As example, we show the plot for category C-Overall (all test-generation tasks) in Fig. 8. All 9 test generators participated in category C-Overall, for which the quantile plot shows the overall performance over all categories (scores for meta categories are normalized, see Sect. 5).
Generation of score-based quantile plots. For the score calculation, we have computed a function that maps each test-generation task to the coverage (error coverage or branch coverage) of the test suite that was generated for this test-generation task and another function to map each testgeneration task to the achieved normalized coverage score. Now, we sort the pairs test-generation task, achieved normalized score by the score in descending order, and accumulate the score and test-generation tasks. The pairs cumulative score, number test-generation tasks define the quantile function, which maps an achieved normalized score to the minimal number of test-generation tasks that are needed to achieve this coverage score with the given test suite. (Note that quantile plots compare quantiles and not individual testgeneration tasks, that is, one cannot tell from a quantile plot the performance on a certain single test-generation task.) Plots of functions like this can easily be generated using tools like gnuplot. 18 Interpretation of data points. A data point (x, y) for a test generator tells us that the tool needed y test-generation tasks to achieve the score of x score points (cumulative, normalized coverage values). For example, the quantile function for VeriFuzz contains the pair (1802.9, 1000), which means that VeriFuzz needs at least 1000 test-generation tasks to achieve a coverage of 1802.9 score points. Such plots make it easy to compare the performance of different test generators, because the graphs are monotonically increasing. The lower and the more to the right a graph is drawn, the better is the test generator.
Overall quality measured in scores (Right end of graph). VeriFuzz is the winner of this category: the x-coordinate of the right-most data point represents the highest total score (and thus, the total value) of the completed test-generation work (cf. Tables 6 and 7; right-most x-coordinates match the score values in the tables). The ranking can be read from the plot of the quantile functions from right to left: The rightmost data point for VeriFuzz is 1 951, for Klee 1 764, for CoVeriTest 1 524, and so on. Consumed Resources. One complete test-generation execution of the competition consisted of 21 204 single test-generation runs (see Table 8). The total CPU time was 122 days and the consumed energy 32.1 kWh for one complete competition run for test generation (without validation). Test-suite validation consisted of 21 204 single test-suite validation runs. The total consumed CPU time was 31.1 days. Each tool was executed several times, in order to make sure no installation issues occur during the execution, and thus, the total consumed resources including pre-runs were a multiple of the above-mentioned amounts of resources.

Conclusion and future plans
Test-Comp 2019 gave an overview of the state of the art in automatic test generation for C programs. This report describes the organizational aspects of the 1 st International Competition on Software Testing (Test-Comp 2019), and the qualitative and quantitative results of the comparative evaluation. The competition attracted nine participating teams from six countries. The feedback from the testing community was positive, and the plan is to hold the competition on software testing annually from now on. We hope that the introduced standards for marking input values, specifying the test-coverage criteria, and writing the generated test suites encourages developers of test generators to apply those standards, in order to deliver tools that are easy to compare and use as components in quality assurance. For the future, the community has plans to increase the number and diversity of the benchmark set, experiment with different time budgets, reduce the size of the generated test suites, include test-generation tasks for Java programs, and extend the categories toward other coverage criteria (e.g., MC/DC) and mutation testing.

Data Availability Statement
The test-generation tasks and results of the competition are published at Zenodo, as described in Table 3. All components and data that are necessary for reproducing the competition are available in public version repositories, as specified in Table 2. Furthermore, the results are presented online on the competition web site for easy access: https://test-comp.sosy-lab.org/2019/results Funding This work was funded in part by the Deutsche Forschungsgemeinschaft (DFG)-418257054 (Coop). Open Access was funded by Projekt DEAL.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution, and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copy-right holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.