An Overview of Competitions in Formal Methods

,


Introduction
Over the last years, our society's dependency on digital systems has been steadily increasing.At the same time, we see that also the complexity of such systems is continuously growing, which increases the chances of such systems behaving unreliably, with many undesired consequences.In order to master this complexity, and to guarantee that digital systems behave as desired, software tools are designed that can be used to analyze and verify the behavior of digital systems.These tools are becoming more prominent, in academia as well as in industry.The range of these tools is enormous, and trying to understand which tool to use for which system is a major challenge.In order to get a better grip on this problem, many different competitions and challenges have been created, aiming in particular at better understanding the actual profile of the different tools that reason about systems in a given application domain.
The first competitions started in the 1990s (e.g., SAT and CASC).After the year 2000, the number of competitions has been steadily increasing, and currently we see that there is a wide range of different verification competitions.We believe there are several reasons for this increase in the number of competitions in the area of formal methods: • increased computing power makes it feasible to apply tools to large benchmark sets, • tools are becoming more mature, • growing interest in the community to show practical applicability of theoretical results, in order to stimulate technology transfer, • growing awareness that reproducibility and comparative evaluation of results is important, and • organization and participation in verification competitions is a good way to get scientific recognition for tool development.
We notice that despite the many differences between the different competitions and challenges, there are also many similar concerns, in particular from an organizational point of view: • How to assess adequacy of benchmark sets, and how to establish suitable input formats?And what is a suitable license for a benchmark collection?• How to execute the challenges (on-site vs. off-site, on controlled resources vs. on individual hardware, automatic vs. interactive, etc.)? • How to evaluate the results, e.g., in order to obtain a ranking?• How to ensure fairness in the evaluation, e.g., how to avoid bias in the benchmark sets, how to reliably measure execution times, and how to handle incorrect or incomplete results?• How to guarantee reproducibility of the results?• How to achieve and measure progress of the state of the art?• How to make the results and competing tools available so that they can be leveraged in subsequent events?
Therefore, as part of the celebration of 25 years of TACAS we organized TOOLympics, as an occasion to bring together researchers involved in competition organization.It is a goal of TOOLympics to discuss similarities and differences between the participating competitions, to facilitate cross-community communication to exchange experiences, and to discuss possible cooperation concerning benchmark libraries, competition infrastructures, publication formats, etc.We hope that the organization of TOOLympics will put forward the best practices to support competitions and challenges as useful and successful events.
In the remainder of this paper, we give an overview of all competitions participating in TOOLympics, as well as an outlook on the future of competitions.Table 1 provides references to other papers (also in this volume) providing additional perspective, context, and details about the various competitions.There are more competitions in the field, e.g., ARCH-COMP [1], ICLP Comp, MaxSAT Evaluation, Reactive Synthesis Competition [57], QBFGallery [73], and SyGuS-Competition.

Overview of all Participating Competitions
A competition is an event that is dedicated to fair comparative evaluation of a set of participating contributions at a given time.This section shows that such participating contributions can be of different forms: tools, result compilations, counterexamples, proofs, reasoning approaches, solutions to a problem, etc.
Table 1 categorizes the TOOLympics competitions.The first column names the competition (and the digital version of this article provides a link to the competition web site).The second column states the year of the first edition of the competition, and the third column the number of editions of the competition.The next two columns characterize the way the participating contributions are evaluated: Most of the competitions are evaluating automated tools that do not require user interaction and the experiments are executed by benchmarking environments, such as BenchExec [29], BenchKit [69], or StarExec [92].However, some competitions require a manual evaluation, due to the nature of the competition and its evaluation criteria.The next two columns show where and when the results of the competition is determined: on-site during the event or off-site before the event takes place.Finally, the last column provides references to the reader to look up more details about each of the competitions.
The remainder of this section introduces the various competitions of TOOLympics 2019.

CASC: The CADE ATP System Competition
Organizer: Geoff Sutcliffe (Univ. of Miami, USA) Webpage: http://www.tptp.org The CADE ATP System Competition (CASC) [107] is held at each CADE and IJCAR conference.CASC evaluates the performance of sound, fully automatic, classical logic Automated Theorem Proving (ATP) systems.The evaluation is       • • [27,32,40,[51][52][53][54][55][56] in terms of: the number of problems solved, the number of problems solved with a solution output, and the average runtime for problems solved; in the context of: a bounded number of eligible problems, chosen from the TPTP Problem Library, and specified time limits on solution attempts.CASC is the longest running of the various logic solver competitions, with the 25th event to be held in 2020.This longevity has allowed the design of CASC to evolve into a sophisticated and stable state.Each year's experiences lead to ideas for changes and improvements, so that CASC remains a vibrant competition.CASC provides an effective public evaluation of the relative capabilities of ATP systems.Additionally, the organization of CASC is designed to stimulate ATP research, motivate development and implementation of robust ATP systems that are useful and easily deployed in applications, provide an inspiring environment for personal interaction between ATP researchers, and expose ATP systems within and beyond the ATP community.

CHC-COMP: Competition on Constrained Horn Clauses
Organizers: Grigory Fedyukovich (Princeton Univ., USA), Arie Gurfinkel (Univ. of Waterloo, Canada), and Philipp Rümmer (Uppsala Univ., Sweden) Webpage: https://chc-comp.github.io/Constrained Horn Clauses (CHC) is a fragment of First Order Logic (FOL) that is sufficiently expressive to describe many verification, inference, and synthesis problems including inductive invariant inference, model checking of safety properties, inference of procedure summaries, regression verification, and sequential equivalence.The CHC competition (CHC-COMP) compares state-ofthe-art tools for CHC solving with respect to performance and effectiveness on a set of publicly available benchmarks.The winners among participating solvers are recognized by measuring the number of correctly solved benchmarks as well as the runtime.The results of CHC-COMP 2019 will be announced in the HCVS workshop affiliated with ETAPS.

CoCo: Confluence Competition
Organizers: Aart Middeldorp (Univ. of Innsbruck, Austria), Julian Nagele (Queen Mary Univ. of London, UK), and Kiraku Shintani (JAIST, Japan) Webpage: http://project-coco.uibk.ac.at/ The Confluence Competition (CoCo) exists since 2012.It is an annual competition of software tools that aim to (dis)prove confluence and related (undecidable) properties of a variety of rewrite formalisms automatically.CoCo runs live in a single slot at a conference or workshop and is executed on the cross-community competition platform StarExec.For each category, 100 suitable problems are randomly selected from the online database of confluence problems (COPS).Participating tools must answer YES or NO within 60 s, followed by a justification that is understandable by a human expert; any other output signals that the tool could not determine the status of the problem.CoCo 2019 features new categories on commutation, confluence of string rewrite systems, and infeasibility problems.

CRV: Competition on Runtime Verification
Organizers: Ezio Bartocci (TU Wien, Austria), Yliès Falcone (Univ.Grenoble Alpes/CNRS/INRIA, France), and Giles Reger (Univ. of Manchester, UK) Webpage: https://www.rv-competition.org/Runtime verification (RV) is a class of lightweight scalable techniques for the analysis of system executions.We consider here specification-based analysis, where executions are checked against a property expressed in a formal specification language.
The core idea of RV is to instrument a software/hardware system so that it can emit events during its execution.These events are then processed by a monitor that is automatically generated from the specification.During the last decade, many important tools and techniques have been developed.The growing number of RV tools developed in the last decade and the lack of standard benchmark suites as well as scientific evaluation methods to validate and test new techniques have motivated the creation of a venue dedicated to comparing and evaluating RV tools in the form of a competition.
The Competition on Runtime Verification (CRV) is an annual event, held since 2014, and organized as a satellite event of the main RV conference.The competition is in general organized in different tracks: (1) offline monitoring, (2) online monitoring of C programs, and (3) online monitoring of Java programs.Over the first three years of the competition 14 different runtime verification tools competed on over 100 different benchmarks 1 .
In 2017 the competition was replaced by a workshop aimed at reflecting on the experiences of the last three years and discussing future directions.A suggestion of the workshop was to held a benchmark challenge focussing on collecting new relevant benchmarks.Therefore, in 2018 a benchmark challenge was held with a track for Metric Temporal Logic (MTL) properties and an Open track.In 2019 CRV will return to a competition comparing tools, using the benchmarks from the 2018 challenge.
For each examination and each model instance, participating tools are provided with up to 3600 s of runtime and 16 GB of memory.Tool answers are analyzed and confronted to the results produced by other competing tools to detect diverging answers (which are quite rare at this stage of the competition, and lead to penalties).
For each examination, golden, silver, and bronze medals are attributed to the three best tools.CPU usage and memory consumption are reported, which is also valuable information for tool developers.Finally, numerous charts to compare pair of tools' performances, or quantile plots stating global performances are computed.Performances of tools on models (useful when they contain scaling parameters) are also provided.These benchmarks are automatically synthesized to exhibit chosen properties, and then enhanced to include dedicated dimensions of difficulty, ranging from conceptual complexity of the properties (e.g., reachability, full safety, liveness), over size of the reactive systems (a few hundred lines to millions of them), to exploited language features (arrays, arithmetic at index pointer, and parallelism).The general approach has been described in [89,90], while variants to introduce highly parallel benchmarks are discussed in [87,88,91].RERS benchmarks have been used also by other competitions, like MCC or SV-COMP, and referenced in a number of research papers as a means of evaluation not only in the context of RERS [31,62,75,77,80,83].

2.6
In contrast to the other competitions described in this paper, RERS is problem-oriented and does not evaluate the power of specific tools but rather tool usage that ideally makes use of a number of tools and methods.The goal of RERS is to help revealing synergy potential also between seemingly quite separate technologies like, e.g., source-code-based (white-box) approaches and purely observation/testing-based (black-box) approaches.This goal is also reflected in the awarding scheme: besides the automatically evaluated questionnaires for achievements and rankings, RERS also features the Methods Combination Award for approaches that explicitly exploit cross-tool/method synergies.

Rodeo for Production Software Verification Tools
Based on Formal Methods Organizer: Paul E. Black (NIST, USA) Webpage: https://samate.nist.gov/FMSwVRodeo/Formal methods are not widely used in the United States.The US government is now more interested because of the wide variety of FM-based tools that can handle production-sized software and because algorithms are orders of magnitude faster.NIST proposes to select production software for a test suite and to hold a periodic Rodeo to assess the effectiveness of tools based on formal methods that can verify large, complex software.To select software, we will develop tools to measure structural characteristics, like depth of recursion or number of states, and calibrate them on others' benchmarks.We can then scan thousands of applications to select software for the Rodeo.Conference (FLoC).The competition consisted of four tracks, including a main track, a "no-limits" track with very few requirements for participation, and special tracks focusing on random SAT and parallel solving.In addition to the actual solvers, each participant was required to also submit a collection of previously unseen benchmark instances, which allowed the competition to only use new benchmarks for evaluation.Where applicable, verifiable certificates were required both for the "satisfiable" and "unsatisfiable" answers; the general time limit was 5000 s per benchmark instance and the solvers were ranked using the PAR-2 scheme, which encourages solving many benchmarks but also rewards solving the benchmarks fast.A detailed overview of the competition, including summary of the results, will appear in the JSAT special issue on SAT 2018 Competitions and Evaluations.

SL-COMP: Competition of Solvers for Separation Logic
Organizer: Mihaela Sighireanu (Univ. of Paris Diderot, France) Webpage: https://sl-comp.github.io/SL-COMP aims at bringing together researchers interested in improving the state of the art of automated deduction methods for Separation Logic (SL).The event took place twice until now and collected more than 1K problems for different fragments of SL.The input format of problems is based on the SMT-LIB format and therefore fully typed; only one new command is added to SMT-LIB's list, the command for the declaration of the heap's type.The SMT-LIB theory of SL comes with ten logics, some of them being combinations of SL with linear arithmetic.The competition's divisions are defined by the logic fragment, the kind of decision problem (satisfiability or entailment), and the presence of quantifiers.Until now, SL-COMP has been run on the StarExec platform, where the benchmark set and the binaries of participant solvers are freely available.The benchmark set is also available with the competition's documentation on a public repository in GitHub.

SMT-COMP
Organizer: Matthias Heizmann (Univ. of Freiburg, Germany), Aina Niemetz (Stanford Univ., USA), Giles Reger (Univ. of Manchester, UK), and Tjark Weber (Uppsala Univ., Sweden) Webpage: http://www.smtcomp.orgSatisfiability Modulo Theories (SMT) is a generalization of the satisfiability decision problem for propositional logic.In place of Boolean variables, SMT formulas may contain terms that are built from function and predicate symbols drawn from a number of background theories, such as arrays, integer and real arithmetic, or bit-vectors.With its rich input language, SMT has applications in software engineering, optimization, and many other areas.
The International Satisfiability Modulo Theories Competition (SMT-COMP) is an annual competition between SMT solvers.It was instituted in 2005, and is affiliated with the International Workshop on Satisfiability Modulo Theories.Solvers are submitted to the competition by their developers, and compete against each other in a number of tracks and divisions.The main goals of the competition are to promote the community-designed SMT-LIB format, to spark further advances in SMT, and to provide a useful yardstick of performance for users and developers of SMT solvers.

SV-COMP: Competition on Software Verification
Organizer: Dirk Beyer (LMU Munich, Germany) Webpage: https://sv-comp.sosy-lab.org/The 2019 International Competition on Software Verification (SV-COMP) is the 8 th edition in a series of annual comparative evaluations of fully-automatic tools for software verification.The competition was established and first executed in 2011 and the first results were presented and published at TACAS 2012 [17].The most important goals of the competition are the following: 1. Provide an overview of the state of the art in software-verification technology and increase visibility of the most recent software verifiers.2. Establish a repository of software-verification tasks that is publicly available for free as standard benchmark suite for evaluating verification software2 .3. Establish standards that make it possible to compare different verification tools, including a property language and formats for the results, especially witnesses.4. Accelerate the transfer of new verification technology to industrial practice.
The benchmark suite for SV-COMP 2019 [23] consists of nine categories with a total of 10 522 verification tasks in C and 368 verification tasks in Java.A verification task (benchmark instance) in SV-COMP is a pair of a program M and a property φ, and the task for the solver (here: verifier) is to verify the statement M |= φ, that is, the benchmarked verifier should return false and a violation witness that describes a property violation [26,30], or true and a correctness witness that contains invariants to re-establish the correctness proof [25].The ranking is computed according to a scoring schema that assigns a positive score (1 and 2) to correct results and a negative score (−16 and −32) to incorrect results, for tasks with and without property violations, respectively.The sum of CPU time of the successfully solved verification tasks is the tie-breaker if two verifiers have the same score.The results are also illustrated using quantile plots. 3he 2019 competition attracted 31 participating teams from 14 countries.This competition included Java verification for the first time, and this track had four participating verifiers.As before, the large jury (one representative of each participating team) and the organizer made sure that the competition follows high quality standards and is driven by the four important principles of (1) fairness, (2) community support, (3) transparency, and (4) technical accuracy.

termComp: The Termination and Complexity Competition
Organizer: Akihisa Yamada (National Institute of Informatics, Japan) Steering Committee: Jürgen Giesl (RWTH Aachen Univ., Germany), Albert Rubio (Univ.Politècnica de Catalunya, Spain), Christian Sternagel (Univ. of Innsbruck, Austria), Johannes Waldmann (HTWK Leipzig, Germany), and Akihisa Yamada (National Institute of Informatics, Japan) Webpage: http://termination-portal.org/wiki/Termination Competition The termination and complexity competition (termCOMP) focuses on automated termination and complexity analysis for various kinds of programming paradigms, including categories for term rewriting, integer transition systems, imperative programming, logic programming, and functional programming.It has been organized annually after a tool demonstration in 2003.In all categories, the competition also welcomes the participation of tools providing certifiable output.The goal of the competition is to demonstrate the power and advances of the state-of-the-art tools in each of these areas.

Test-Comp: Competition on Software Testing
Organizer: Dirk Beyer (LMU Munich, Germany) Webpage: https://test-comp.sosy-lab.org/The 2019 International Competition on Software Testing (Test-Comp) [24] is the 1 st edition of a series of annual comparative evaluations of fully-automatic tools for software testing.The design of Test-Comp is very similar to the design of SV-COMP, with the major difference that the task for the solver (here: tester) is to generate a test suite, which is validated against a coverage property, that is, the ranking is based on the coverage that the resulting test-suites achieve.
There are several new and powerful tools for automatic software testing around, but they were difficult to compare before the competition [28].The reason had been that so far no established benchmark suite of test tasks was available and many concepts were only validated in research prototypes.Now the test-case generators support a standardized input format (for C programs as well as for coverage properties).The overall goals of the competition are: • Provide a snapshot of the state-of-the-art in software testing to the community.This means to compare, independently from particular paper projects and specific techniques, different test-generation tools in terms of precision and performance.• Increase the visibility and credits that tool developers receive.This means to provide a forum for presentation of tools and discussion of the latest technologies, and to give the students the opportunity to publish about the development work that they have done.• Establish a set of benchmarks for software testing in the community.This means to create and maintain a set of programs together with coverage criteria, and to make those publicly available for researchers to be used free of charge in performance comparisons when evaluating a new technique.The aims of the VerifyThis competition are:

VerifyThis
• to bring together those interested in formal verification, • to provide an engaging, hands-on, and fun opportunity for discussion, and • to evaluate the usability of logic-based program verification tools in a controlled experiment that could be easily repeated by others.
The competition offers a number of challenges presented in natural language and pseudo code.Participants have to formalize the requirements, implement a solution, and formally verify the implementation for adherence to the specification.
There are no restrictions on the programming language and verification technology used.The correctness properties posed in problems will have the input-output behaviour of programs in focus.Solutions will be judged for correctness, completeness, and elegance.

On the Future of Competitions
In this paper, we have provided an overview of the wide spectrum of different competitions and challenges.Each competition can be distinguished by its specific problem profile, characterized by analysis goals, resource and infrastructural constraints, application areas, and dedicated methodologies.Despite their differences, these competitions and challenges also have many similar concerns, related to, e.g., (1) benchmark selection, maintenance, and archiving, (2) evaluation and rating strategies, (3) publication and replicability of results, as well as (4) licensing issues.
TOOLympics aims at leveraging the potential synergy by supporting a dialogue between competition organizers about all relevant issues.Besides increasing the mutual awareness about shared concerns, this also comprises: • the potential exchange of benchmarks (ideally supported by dedicated interchange formats), e.g., from high-level competitions like VerifyThis, SV-COMP, and RERS to more low-level competitions like SMT-COMP, CASC, or the SAT competition, • the detection of new competition formats or the aggregation of existing competition formats to establish a better coverage of verification problem areas in a complementary fashion, and • the exchange of ideas to motivate new participants, e.g., by lowering the entrance hurdle.
There have been a number of related initiatives with the goal of increasing awareness for the scientific method of evaluating tools in a competition-based fashion, like the COMPARE workshop on Comparative Empirical Evaluation of Reasoning Systems [63], the Dagstuhl seminar on Evaluating Software Verification Systems in 2014 [27], the FLoC Olympics Games 20144 and 2018 5 , and the recent Lorentz Workshop on Advancing Verification Competitions as a Scientific Method6 .TOOLympics aims at joining forces with all these initiatives in order to establish a comprehensive hub where tool developers, users, participants, and organizers may meet and discuss current issues, share experiences, compose benchmark libraries (ideally classified in a way that supports cross competition usage), and develop ideas for future directions of competitions.
Finally, it is important to note that competitions have resulted in significant progress in the research areas that they belong to, respectively.Typically, new techniques and theories have been developed, and tools have become much stronger and more mature.This sometimes means that a disruption in the way that the competitions are handled is needed, in order to adapt the competition to these evolutions.It is our hope that platforms such as TOOLympics facilitate and improve this process.
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/),which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material.If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

QComp: The Comparison of Tools for the Analysis of Quantitative Formal Models Organizers:
Arnd Hartmanns (Univ. of Twente, Netherlands) and Tim Quatmann (RWTH Aachen Univ., Germany) Webpage: http://qcomp.orgQuantitativeformal models capture probabilistic behaviour, real-time aspects, or general continuous dynamics.A number of tools support their automatic analysis with respect to dependability or performance properties.QComp 2019 is the first competition among such tools.It focuses on stochastic formalisms from Markov chains to probabilistic timed automata specified in the JANI model exchange format, and on probabilistic reachability, expected-reward, and steady-state properties.QComp draws its benchmarks from the new Quantitative Verification Benchmark Set.Participating tools, which include probabilistic model checkers and planners as well as simulation-based tools, are evaluated in terms of performance, versatility, and usability.Term rewriting is a simple, yet expressive model of computation, which finds direct applications in specification and programming languages (many of which embody rewrite rules, pattern matching, and abstract data types), but also indirect applications, e.g., to express the semantics of data types or concurrent processes, to specify program transformations, to perform computer-aided verification.The Rewrite Engines Competition (REC) was created under the aegis of the Workshop on Rewriting Logic and its Applications (WRLA) to serve three 2.7 REC: The Rewrite Engines CompetitionOrganizers: Francisco Durán (Univ. of Malaga, Spain) and Hubert Garavel (Univ.Grenoble Alpes/INRIA/CNRS, Grenoble INP/LIG, France) Webpage: http://rec.gforge.inria.fr/

RERS: Rigorous Examination of Reactive System Organizers:
Falk Howar (TU Dortmund, Germany), Markus Schordan (LLNL, USA), Bernhard Steffen (TU Dortmund, Germany), and Jaco van de Pol (Univ. of Aarhus, Denmark) Webpage: http://rers-challenge.org/Reactive systems appear everywhere, e.g., as Web services, decision support systems, or logical controllers.Their validation techniques are as diverse as their appearance and structure.They comprise various forms of static analysis, model checking, symbolic execution, and (model-based) testing, often tailored to quite extreme frame conditions.Thus it is almost impossible to compare these techniques, let alone to establish clear application profiles as a means for recommendation.Since 2010, the RERS Challenge aims at overcoming this situation by providing a forum for experimental profile evaluation based on specifically designed benchmark suites.
SAT Competition 2018 is the twelfth edition of the SAT Competition series, continuing the almost two decades of tradition in SAT competitions and related competitive events for Boolean Satisfiability (SAT) solvers.It was organized as part of the 2018 FLoC Olympic Games in conjunction with the 21 th International Conference on Theory and Applications of Satisfiability Testing (SAT 2018), which took place in Oxford, UK, as part of the 2018 Federated Logic