The RERS challenge: towards controllable and scalable benchmark synthesis

This paper (1) summarizes the history of the RERS challenge for the analysis and verification of reactive systems, its profile and intentions, its relation to other competitions, and, in particular, its evolution due to the feedback of participants, and (2) presents the most recent development concerning the synthesis of hard benchmark problems. In particular, the second part proposes a way to tailor benchmarks according to the depths to which programs have to be investigated in order to find all errors. This gives benchmark designers a method to challenge contributors that try to perform well by excessive guessing.


Introduction
Competitions and challenges have provided a valuable contribution to the development of verification and analysis tools, and numerous events of this kind have evolved over the last decades [4,6,8,29,32,36,43].The approaches followed by these in many cases recurring events vary from off-site to on-site, with or without concrete resource constraints, from solution orientation to tool orientation, from known benchmark problems to problems with unknown true properties to controlled, generated benchmarks, and from qualitative/human evaluation to automated evaluation processes etc. (cf.Sect. 2 and [5]).
The RERS challenge is characterized by its propertyoriented benchmark generation: benchmarks are automat-ically generated in a "requirements-driven" fashion.More precisely, the starting point of the benchmark generation process is a set of desired structural properties, here formulated in LTL, which are successively transformed via Büchi automata that characterize all satisfying executions to Modal Transitions Systems and, with a few more steps, transformed to code of various implementation languages (cf.Sect.2.4).This construction principle aims at benchmarks that closely resemble realistic code, but can be flexibly tailored in their degree of difficulty.Originally, we considered size, amount of arithmetic operations, and the data structures used as a measure for intricacy.Over the years, the importance of controlling the length of shortest counterexamples as a means for scaling the difficulty of the verification task (in contrast to the complexity of the benchmark problem) became more and more apparent.
This paper consists of two parts: The first part (Sect.2) summarizes the history of RERS, its profile and intentions, its relation to other competitions, and, in particular, its evolution due to the feedback of participants.This also comprises a discussion of experienced 'oddities', both at RERS and in relation to other competitions, as well as ways to overcome them.The second part (Sect.3) presents our most recent development concerning the control of counterexample lengths.The proposed tailoring of benchmarks focusing on the depths to which programs have to be investigated in order to find all errors gives benchmark designer a way to challenge contributors who are claiming satisfaction without sufficient evidence.

The RERS challenge
The Rigorous Examination of Reactive Systems challenge (RERS) is a verification challenge that focuses on temporal and reachability properties of reactive systems.RERS was founded in 2010 and has annual instances since 2012.The challenge was designed to explore, evaluate, and compare the capabilities of state-of-the-art software verification tools and techniques.Areas of interest include but are not limited to static analysis [52], model checking [9,14,25], symbolic execution [38], and (learning-based) testing [62].
The key idea of RERS is to use generated, realistic problems of scalable complexity on which participants have to check sets of properties.
Automatic generation of benchmarks with known properties provides new problems each year that are (a) previously unknown to participants, and (b) for which the correct verdicts for properties are not known to the participants during the challenge-preventing "performance tuning" of participating verification tools towards a high score on the basis of known expected results or known characteristics of benchmarks.Realism of benchmarks (in contrast to typical randomly generated benchmarks) is achieved in a requirementsdriven fashion: programs are generated according to characteristic temporal patterns resembling the structure of real code.Scalability of difficulty is the basis for detailed performance profiling of participating tools.
In this section, we provide a brief motivation for RERS, an overview of the history of RERS, sketch some of the scientific contributions on the automatic generation of benchmarks that were facilitated through RERS, and briefly describe how different ranking methods in RERS enable detailed profiling of tools.
Remark Parts of this section have been published before in papers or on the RERS website.We provide pointers to more detailed accounts where it is appropriate and possible.The focus of this section is on providing a general overview.

Rigorous examination of reactive systems
The motivation of RERS is to enable profiling of principal capabilities and limitations of tools and approaches.The RERS challenge is therefore "free-style", i.e., without neither time nor resource limitations, and encourages the combination of methods and tools.Strict time or resource limitations in combination with previously known solutions encourage tools to be tweaked for certain training sets, which could give a false impression of their capabilities.It also leads to abandoning time consuming problems in the interest of time.Our focus on principal capabilities instead of defined and identical resources is reflected by making RERS a challenge instead of a competition.We only provide the tasks and collect the results from participants.Solutions are computed by them in any way they want.The main goals of RERS are: 1. encourage the combination of methods from different (and usually disconnected) research fields for better software verification results, 2. provide a complete framework for an automated challenge organization that covers the process from generating differently tailored tasks that reveal the strengths and weaknesses of specific approaches to an automated result comparison (excluding the computation of results itself), 3. initiate a discussion about better benchmark generation, reaching out across the usual community barriers to provide benchmarks useful for testing and comparing a wide variety of tools, and 4. collect (additional) interesting syntactical features that should be supported by benchmark generators.
There exists no other software verification challenge with a profile that is similar to that of RERS: while (1) is a quite generic goal that is pursued by a number of verification competitions, goals (2)-( 4) are unique to RERS.Nevertheless, RERS shares some intentions and characteristics with SV-COMP, MCC, and VerifyThis.The software verification competition [8] (SV-COMP) is also concerned with reachability properties and features a few verification tasks concerning termination and memory safety.In direct comparison, SV-COMP does not allow the manual combination of tools and directly addresses tool developers.In contrast to RERS, it has time and resource limitations, does not feature certificate-like achievements (cf.Sect.2.5), but has developed a detailed ranking system for the comparison of tools and tries to prevent guessing by imposing high penalties on mistakes.An important difference to SV-COMP is that RERS features benchmarks that are generated automatically for each challenge iteration, ensuring that all results to the verification tasks are unknown to participants.Over time, the RERS benchmark generator contributed problems to the SV-COMP benchmark repository.
Another competition concerned with the verification of parallel systems in combination with LTL properties is the Model Checking Contest [43] (MCC).Participants have to analyze Petri nets as abstract models and check LTL and CTL formulas, the size of the state space, reachability, and various upper bounds.The benchmark consists of a large set of known models and a small set of unknown models that were collected among the participants.
In contrast to RERS, MCC participants submit tools that have to adhere to resource restrictions, rather than problem answers.Moreover, the correct answers to the used verification tasks are not always known, and a majority vote-based approach to correctness is used. 1 This may well penalize outstanding approaches that are e.g.unique in identifying the correct result.This problem is overcome for RERS due to its property-oriented benchmark generation.We were happy to hear that MCC started using some verification tasks of RERS to partially overcome this problem.
Finally, VerifyThis [29] features program verification challenges.Participants get a fixed amount of time to work on a number of challenges, to prove the functional correctness of a number of non-trivial algorithms.That competition focuses on the use of interactive or semi-interactive tools.Similar to RERS, VerifyThis encourages the use of a mixture of tools, however submissions are judged by a jury.In direct comparison, RERS participants submit results that can be checked and ranked automatically; only the "best approach award" involves a jury judgment.

Genesis (from ZULU to RERS)
The idea for RERS arose in 2010 after participating in the ZULU automata learning competition [28].The ZULU competition had some very exciting and some rather frustrating aspects.The competition was based on randomly generated automata, the participating learning tools competed in a black-box scenario, and questionnaires (sets of words for which participants had to decide language membership) were the basis for ranking tools.
One the one hand, ZULU had an incredibly engaging training and competition mode: contestants could generate new training problems in a push-button approach and ranking of tools on all benchmark instances was instantly visible.Improvements to algorithms did translate to almost instant gratification, fueling a month-long race for the win.
The mode of ranking performance by counting correct answers in questionnaires, on the other hand, did not serve well for differentiating tools and in some cases even favored learning algorithms that were already known to perform badly on real problems.Less precise models produced better predictions for certain distributions of words in questionnaires.Moreover, algorithms could (and were) tuned towards the structural properties of a randomly generated benchmark.It turned out that this tuning was often counterproductive for inferring models of real systems.
The RERS initiative aimed at developing an engaging challenge, or a set of challenges (cf.Sect.2.3), in the area of formal methods that would overcome the perceived weaknesses of ZULU.As a consequence, RERS is based on generated benchmarks (cf.Sect.2.4), and one of the long-term goals of RERS is making the generation of new benchmarks accessible to participants.At the same time, the approach to benchmark generation in RERS aims at generating benchmarks that have realistic properties-resulting in relevant performance profiles of tools.This aim is also supported by RERS providing multiple modes of ranking and rating, tailored to profile contributions according to their capabilities and limitations (cf.Sect.2.5).

Tracks and history
After an initial workshop in 2010, RERS had yearly challenges since 2012 with a constantly evolving set of tracks and verification challenges.Since 2012, a total of 49 people from 16 different research groups participated in RERS. 2 Table 1 provides a comprehensive overview that will be detailed by the remainder of this section.
Sequential Programs RERS started in 2012 with sequential benchmark programs in two tracks (LTL and Reachability) that correspond to the property type that has to be analyzed.Sequential benchmark programs are made available as Java and C programs.Since 2014, there are three categories in each track that represent the syntactical features included in the benchmark programs that belong to the respective category.
Plain.The program only contains assignments, with the exception of some scattered summation, subtraction, and remainder operations in the reachability problems.Arithmetic.The programs frequently contain summation, subtraction, multiplication, division, and remainder operations.Data structures.Arrays and operations on arrays are added (Other data structures are planned for the future).
In each category, small, medium-sized, and large programs are generated for a challenge benchmark.
Starting in 2020, LTL properties will be controlled for minimal depth of counterexamples (presented in this paper), enabling an additional dimension in which complexity can be scaled.Parallel Programs Since 2016, RERS features benchmarks that contain parallel systems which are made available as labeled transition systems (LTSs), Promela [24] code, and Petri nets [20,53].The parallel track started with LTL properties and was tentatively extended to CTL properties in 2018.As a new addition in 2019, CTL properties were fully supported as a full track for the category of parallel programs (e.g.Petri nets) and were therefore on par with our support for LTL model checking tasks.
Experimental Tracks In several years, RERS had experimental tracks that did not (yet) result in permanent additions to the challenge.
-In 2013, RERS featured grey-box and black-box problems in addition to the (default) white-box problems.
The additional problems were intended to encourage participation of black-box approaches and facilitated integration of white-box and black-box techniques.-In 2015, RERS was co-located with the international conference on runtime verification (RV) [6] and featured monitoring problems for which traces were provided.-In 2019, for the first time in the history of RERS, the challenge featured benchmark programs that are based on real-world models [32].The corresponding challenge tracks were based on a cooperation with ASML, a large Dutch semiconductor company who provided the underlying models.Properties that participants could analyze for these systems ranged from reachability queries over LTL formulas to CTL properties (omitted in Table 1).
A detailed history and description of all past tracks and all sets of challenge problems can be found on the RERS website 3  along with properties and expected verdicts.

Synthesis of benchmarks with known properties
RERS relies on generated benchmark problems of scalable complexity and with known properties.The motivation for this, as stated above, is to enable detailed profiling of tools.
The RERS benchmark generation technology combines scalable complexity with known properties, two goals that appear conflicting at first glance: it is impossible to automatically decide properties on problems that are too complex for current tools to analyze.Other competitions (e.g.MCC) solve this by determining verdicts that ought to be accepted as correct by majority vote.This, of course, has the drawback that a high performance of few tools, resulting in uncommon but correct verdicts on some problems, leads to a competition ranking that is inversely correlated to performance.We observed this firsthand in the ZULU competition.Motivated by this experience, we have developed a generic method and tool-boxes for generating benchmark problems of scalable complexity with known properties.A frequent argument against the use of generated benchmarks is the potential threat to the validity of profiling results that arises from their artificial nature.We address this threat in RERS by using sets of LTL properties for inducing structure or actual industrial system models at the core of our benchmark synthesis.
In this section, we provide a brief overview of the generic method, using the generation of sequential benchmark problems as a concrete example.Detailed accounts of concrete tool-boxes for different classes of benchmark problems can be found in the papers listed in Table 2. General Approach Our general approach to the generation of benchmarks that we use in RERS is sketched in Fig. 1 and exists in two variants, property-based benchmark generation and model-based benchmark generation.Both variants fol- Property-driven benchmark generation [63] 2014 Tailored generation of concurrent benchmarks [61] 2014 Property-driven benchmark generation: synthesizing programs of realistic structure [64] 2017 Property-preserving generation of tailored benchmark petri nets [67] 2018 Synthesizing subtle bugs with known witnesses [35] low the same high-level pattern.The process is divided into two phases.In the first phase (upper half of both sub-figures), benchmark properties are established on a small model.In the second phase, models are expanded by semantics-preserving transformations that increase complexity at the model-level and then generated into code.Code generation can add another dimension of complexity by encoding the behavior specified in the model through different language features (e.g., using arithmetic expressions or data structures).

Property-based Benchmark Generation
In this variant (left of Fig. 1), we start the generation process by randomly choosing and then instantiating LTL property specification patterns [19] that we partition into a small defining set used in the subsequent synthesis step, and a larger set of additional properties whose validity is later checked on the synthesized model via model checking.Typically, we generate around 100 properties, about ten of which can be defining, in order to still allow for automated synthesis.Our current implementation uses LTL2Buchi [23] and the Spot library [18] for translating the LTL specification into a Büchi automaton.The resulting intermediate Büchi automaton is then transformed into a concrete reactive system model (a Mealy machine) that represents all words/paths satisfying the defining properties.The construction of this Mealy machine is randomized and can be customized in various dimensions, e.g., the size of the model, the size of the input and output alphabets, the density of the transition graph etc., while guaranteeing that all defining properties remain valid.

Model-based Benchmark Generation
In this variant, we start from a reactive system model.Such models were provided by ASML in 2019 [32].Properties and verdicts can then be computed in two different ways from these models (right of Fig. 1).Generated properties can be model-checked as in the property-driven approach.This was, e.g., done for LTL properties in the industrial track of RERS 2019.Alternatively, properties can be computed from the models directly.This was done for CTL properties in the industrial track of RERS 2019.

Model-Expansion and Code Generation
In the second phase of generating sequential benchmark problems, Mealy Machines are enlarged via randomized property-oriented expansion (POE) [60] and by introducing unreachable states.Both transformations are incremental and can be stopped at any moment, e.g., when a certain threshold of states is reached.The transformation from Mealy Machines to programs interprets Mealy machines as simple loops of guarded commands, whose guards precisely check for the correct state identification, and replaces the simple guard structure with a complex, semantically equivalent decision structure.
As a final step, we employ data-flow analysis, transformation and code motion techniques [12,[39][40][41][42]49,58,68] to randomly elaborate the program model structure along both the logical and the control structure, delocalizing information and obtaining quite general while-program-like structures [2].

Generation of Parallel Benchmarks
We have also applied our property-preserving generation process to obtain parallel systems in various formats like (Nested-Unit) Petri nets [20,53], Promela [24] code, or simply as graphs in DOT 4 .This also happens in two conceptual steps: first, we synthesize an interesting core model from an LTL specification in the same way as for the sequential case, and then decompose this core model into parallel components in a property-preserving fashion.Key for this decomposition was a new notion of modal contracts [65] which allows us to generate parallel systems with arbitrarily many components.
Basing property preservation on modal refinement [46] instead of language inclusion guarantees that not only lineartime properties are preserved, but branching-time properties as well.This allows us to use the generated parallel models not only for the reachability and the LTL model checking track, but also for the CTL model checking and bisimulation checking track [66], the later being planned as a future addition.

Ranking
RERS has a three-dimensional reward structure that consists of a competition-based ranking on the total number of points, achievements for solving problems without submitting wrong answers, and an evaluation-based award for the most original idea or a good combination of methods.Computation of scores and modes of ranking (per track, per category) have evolved slightly over the years.Adaptations

Benchmark
Profile Properties were made to arrive at more detailed, relevant, and valid profiling of participating approaches.

Competition-based Ranking
The competition-based ranking was established to facilitate competition and as a direct comparative evaluation of the capabilities of tools.Participants are free to opt out of this ranking and to only aim at obtaining achievements.For the ranking, a score for the performance of every participating tool is computed.Based on these scores, tools are ranked.Positive points are awarded for correct verdicts.Incorrect verdicts lead to penalties whose heights was a major point of discussion over the years, leading to frequent changes.The negative impact of incorrect verdicts on a tool's score in the competition-based ranking was originally quite small.In 2012 it was just −1 point, and it was −2 points in 2013 to 2015.In 2016 there was a drastic change in the penalty which became exponential in the number of errors (−2 n ).This change has turned out to be too drastic and we are therefore using quadratic penalties (−n 2 ) since 2018.
For RERS 2019, also the points for correct answers were refined (from previously one point per correct answer) to two points for verifiable LTL properties or unreachable errors and one point for refutable LTL properties or reachable errors, accounting for the fact that showing the existence of counterexamples and errors is usually easier than proving their absence.Achievements To honor the accomplishments of verification tools and methods without the pressure of loosing in a competition despite good results and only in relation to the complexity of the set of benchmark problems, RERS introduced achievements for different nuances of difficulty.
For every category there are three achievements: bronze, silver and gold.An achievement is only awarded if no wrong answers are given in the respective category.For tracks on CTL properties, a participant needs to answer 12 out of 20 properties correctly in order to "solve" an individual problem.If there are n problems within such a track, then a participant needs to answer 1  3 • n • 12 properties correctly for a bronze award, 2  3 • n • 12 for silver and n • 12 for gold.For the remaining tracks of RERS, proving the absence of a property violation is typically much harder than showing such a violation.Taking this into account, achievements are awarded for reaching a threshold of points that is equal to the number of counterexamples that can be witnessed for the corresponding group of benchmark instances, as long as no wrong answer is given.Counterexamples are paths reaching an error function for the Reachability track and paths violating LTL properties for the LTL tracks.Only the highest achievement for every category is actually awarded and the thresholds for every category are calculated as follows: bronze = #falsifiable properties of small problem silver = #bronze + falsifiable prop. of med.problem gold = #silver + falsifiable prop. of large problem The participant's achievement score within a category is computed from all submitted results (verified or falsified).Let a t (C) = n be the achievement score of tool t for category C, where n is the number of correct (i.e., reported) verdicts for category C. Now let, e.g.,

br onze(C) ≤ a t (C) and a t (C) < silver (C).
Then participant t is awarded the Bronze Achievement in category C. It is possible to receive six achievements in the sequential tracks: one for each category (Plain, Arithmetic, Data Structures) in the Reachability and LTL track, respectively.In the parallel tracks, an overall of six achievements can be obtained by participating tools for small, medium, and large problems in the LTL and CTL tracks.
Since achievements are awarded on a per-participant basis, there may be multiple gold-medalists in some category in any particular year of RERS.Evaluation-based Award To honor creativity and crossfertilization between different research areas, RERS features jury-based awards.For these awards, category winners are chosen based on the employed (combination of) methods which must not necessarily have scored highest.Submitted descriptions of approaches and solutions are reviewed and ranked by the challenge organizers.Due to the possible variety of methods there may be several winners in this category.

Impact
In the ten years since its inception, RERS has had an impact in different dimensions.Scientific Contributions.First of all, RERS has facilitated a number of scientific advances by challenge participants.Some examples are presented in [1,7,10,11,[15][16][17]30,34,37,44,45,47,48,50,51,56,57,59,[69][70][71].Benchmark Generation.Organizing RERS required the generation of benchmarks.Over the past decade, we have developed multiple approaches for generating scalable and realistic benchmarks with known properties.Benchmark generation required integration of a diverse set of formal methods and RERS benchmarks have been integrated by other verification competitions (e.g., SV-COMP) into their sets of benchmark programs.Combination of Methods.Over the years, RERS has facilitated a number of promising combinations of methods, e.g.[44].In the latest instance, participants of RERS 2019 notably used diverse combinations of tools to produce their answers to the given verification tasks.As an example for this diversity, one of the participating teams combined verification based on grey-box fuzzing and traditional compiler-based interval analysis.Another team employed three different available verification tools to generate their submission and thereby profiled and utilized the individual strengths and weaknesses of these tools.
In summary, one can argue that instead of submitting an executable tool that computes a single verdict automatically as commonly required in verification competitions such as SV-COMP or MCC, participants of RERS make use of the freedom from resource constraints by employing an entire toolkit to solve the given verification tasks.RERS allows manual comparison of the output of tools and gives room for final judgment made by humans on the verdicts of a bouquet of verifiers and approaches, whereas other competitions enforce completely automated decisions by tools.This plethora of approaches provides evidence that RERS achieves one of its main goals, namely to motivate the comparison of different approaches and technologies (see Sect. 2.1).

Guaranteeing hardness of benchmarks
In this section, we sketch our most recent approach to tailor benchmark problems according to hardness: the generation of benchmark problems which are known to have no evidence for a counterexample that is shorter than a given threshold, but which are also guaranteed to have such evidence for an additionally provided upper bound.This allows the production of benchmarks with a designed distribution of depths to which the programs have to be investigated in order to find all errors.In particular, this gives benchmark designers a methodology for challenging contributors that are claiming satisfaction without having a proper proof.

Preliminaries
Fundamental to our approach are the notions related to words and languages: Definition 1 (Words) Given a finite alphabet Σ, a word over Σ is a (possibly empty or infinite) sequence of symbols from Σ.Given an integer n ∈ N and a finite word w = σ 1 σ 2 . . .σ n , |w| denotes the length n of w.Any infinite word w has the length |w| = ∞.Given any word w = σ 1 σ 2 . . .and any integer i ∈ N such that i ≤ |w|, w ≤i denotes the prefix of w of length i. Definition 2 (Languages) Given a finite alphabet Σ, a language (over Σ) is a set of words over Σ.For a given n ∈ N, the language Σ n consists of all words w = σ 1 σ 2 , . . .σ n of length |w| = n such that σ i ∈ Σ for all i ∈ 1 .. n.
For any n ∈ N, we define Σ ≤n := n i=1 Σ i , and additionally Σ * := i∈N Σ i .A language L is finite iff |L| ∈ N and infinite otherwise.Σ ω denotes the infinite language that contains all infinite words over Σ.Moreover, L is a language of finite words iff L ⊆ Σ * , and a language of infinite words iff L ⊆ Σ ω .The concatenation of symbols extends naturally to languages: Given a language L ⊆ Σ * and any language L , we have Our approach to benchmark generation (cf.Sect.3.3) is based on the automatic generation of Büchi automata [13].
Definition 3 (Büchi Automaton) Let B = (S, Σ, Δ, s 0 , F) be a finite automaton with a set S of states and an alphabet Σ.State s 0 ∈ S represents the initial state and F ⊆ S a set of accepting states.The relation Δ ⊆ (S × Σ × S) represents transitions between states in S. We also write p σ → q to denote ( p, σ, q) ∈ Δ.
A path p in B is a sequence of transitions u i σ i → u i+1 with i ranging from 1 to either a fixed integer n or infinity.Path p spells the word w = σ 1 σ 2 . . . .Given these definitions, B is called a Büchi automaton if it adheres to Büchi acceptance, meaning that it accepts infinite words w ∈ Σ ω based on the following criteria: 1.There exists a path p in B that starts in s 0 and spells w 2. This path p visits a state in F infinitely often The set L(B) := {w ∈ Σ ω | B accepts w} defines the language of B.
The following definitions specify (propositional) linear temporal logic (LTL) [54] which we use to specify properties and as a basis for synthesizing Büchi automata.In essence, LTL is an extension of propositional logic that includes additional temporal operators.Its syntax is defined as follows [3]: Definition 4 (Syntax of Linear Temporal Logic) Let AP be a set of atomic propositions and a ∈ AP.The syntax of propositional linear temporal logic (LTL) is defined by the following grammar in Backus-Naur form: The operator X (or "next") describes behavior that has to hold at the next time step.A formula (ϕ 1 U ϕ 2 ) describes that ϕ 2 has to occur eventually and that ϕ 1 has to hold until ϕ 2 occurs in a sequence.The formal semantics of LTL is based on a satisfaction relation between infinite words and LTL formulas [3]: Definition 5 (Semantics of LTL) Let AP be an alphabet of atomic propositions and let (2 AP ) ω denote infinite sequences over sets A ⊆ AP.For any sequence w = (A 1 ,A 2 , . ..) ∈ (2 AP ) ω and any i ∈ N, let w i = A i be the i-th element of w and w ≥i = (A i , A i+1 , . ..) be the suffix of w starting at index i.
Given a language L ⊆ Σ ω , we define and given a Büchi automaton B, we further define For any ϕ ∈ LTL, the semantics ϕ of ϕ is given by Büchi automata are strictly more expressive than LTL [72].
One can synthesize a Büchi automaton B from an LTL property ϕ such that L(B) = ϕ holds [55].
Using the basic set of operators in Definition 4, abbreviations for commonly described constraints can be introduced.Popular ones include F(ϕ) := ( U ϕ) which expresses that ϕ will eventually become true and its dual operator G(ϕ) := ¬F(¬ϕ) which claims that ϕ is always true.A later example also utilizes the weak-until operator In the following, we introduce our approach to specify languages such that a given verification property ϕ ∈ LTL is violated, however in a way such that all counterexamples that witness this violation have a guaranteed minimal length.

Guaranteeing deep LTL counterexamples
In this section we show how to construct (m, n]-hard verification tasks.Here, hardness is based on an integer interval (m, n] of prefix lengths that means the following: looking at prefixes of words w ∈ L of length at most m does not suffice to explain the property violation, however there exists such a violating prefix of length at most n.In other words, every prefix of length smaller or equal to m can be extended to a word that satisfies ϕ, but this is not the case for all prefixes of length up to n.We aim for verification tasks (L, ϕ) such that 1. L ⊆ Σ ω and 2. ϕ is an LTL formula satisfying that 3. (L, ϕ) is (m, n]-hard. In the following, we only synthesize reactive programs and LTL properties for reasoning about non-terminating paths.Our construction then works by constructing a maximal sublanguage L ⊆ L that is (m, n]-hard w.r.t.ϕ (see.Sec.3.3 for our realization based on Büchi automata).In general, L may well be empty, a phenomenon that we deal with in a heuristic fashion.
The following notion of violating prefix is important: Then w violates ϕ iff the following holds: An infinite word w ∈ Σ ω k-violates ϕ iff its prefix w ≤k violates ϕ.A language L ⊆ Σ ω k-violates ϕ iff there exists a word w ∈ L such that w k-violates ϕ.
Intuitively speaking, a finite word violates ϕ if it cannot be extended to a word that satisfies ϕ.The following lemma follows straightforwardly: This monotonicity property allows us to specify (m, n]-hardness simply based on the boundaries of this integer interval.Definition 7 (Hardness) A language L ⊆ Σ ω is called (m, n]-hard w.r.t.ϕ iff the following hold: Based on this hardness definition, we can deduce a constructive approach to generate the maximal sub-language of L that is (m, n]-hard w.r.t.ϕ.We simply construct the maximal sub-language L m ϕ of L that does not m-violate ϕ and then check whether or not L m ϕ n-violates ϕ.If it does, (L m ϕ , ϕ) is an (m, n]-hard verification task.Otherwise, we know that no (m, n]-hard verification task exists for L and ϕ, and we continue by heuristically modifying the parameters. The remainder of this section is therefore dedicated to the construction of L m ϕ and the subsequent check whether it n-violates ϕ.Definition 8 (Violating Prefixes) Let L ⊆ Σ ω and k ∈ N. We denote the set of prefixes of L with length at most k by Given a ϕ ∈ LTL, we call VP(L, ϕ, k) := L ≤k \ ϕ ≤k the violating prefixes of ϕ in L with length at most k.
The following lemma is straightforward to prove: Lemma 2 Let k ∈ N. Then VP(L, ϕ, k) consists of all words w ∈ L ≤k that violate ϕ.
The following theorems follow straightforwardly from Lemmas 1 and 2: and Theorem 2 Complementation of Büchi automata is a very expensive operation.The following theorem guarantees that this operation can be avoided and instead replaced by one that executes in quadratic time: Proof We show the two inclusions between L m ϕ and L := L ∩ ( ϕ ≤m Σ ω ).
Every word w ∈ L lies in ϕ ≤m Σ ω which excludes that it m-violates ϕ.Thus we have as desired L ∩ ( ϕ ≤m Σ ω ) ⊆ L m ϕ .For the converse inclusion let w ∈ L m ϕ .According to Def. 6, this means that there exists a word w ∈ Σ ω such that w ≤m w satisfies ϕ which yields w ≤m ∈ ϕ ≤m and therefore in particular w ∈ ϕ ≤m Σ ω .On the other hand, L m ϕ ⊆ L. Together this guarantees that w ∈ L .
The next section presents the Büchi automaton-based realization of L m ϕ in the way that it is used for our RERS benchmarks.

Realization based on Büchi automata
RERS' benchmark generation follows the idea of requirement-driven system generation.More precisely, starting point for RERS benchmarks is a set of structural LTL properties Φ which are meant to impose realistic benchmarks structures.Thus, the initial languages L we consider in the rest of this paper are of the form L = Φ , and the goal is to construct L m ϕ = Φ m ϕ .According to Theorem 3 this means that we have to compute This can be done by means of well-known technology for Büchi automata as follows: 1. Compute L = Φ and ϕ .We use the Spot library [18] for this purpose.Please note that we need to constrain the construction of L = Φ such that all transitions within the resulting Büchi automaton are labeled with a single atomic proposition.This can be accomplished by enforcing an according invariant Ω in LTL (cf.[35]).2. Concatenate the prefix tree of depth m for ϕ with Σ ω to obtain a Büchi automaton for ϕ ≤m Σ ω .Essentially, this means to add an accepting Σ ω -loop at the end of each leaf of this prefix tree.
3. Compute the intersection of the two Büchi automata constructed in steps 1 and 2. This is again accomplished using the Spot library.4. Heuristically minimize the Büchi automaton that results from step 3, again based on the Spot library.This is important for the scalability of later transformation steps in our overall approach, and it helps to obfuscate the tree expansion in step 2.
In order to be sure that (L , ϕ) is indeed an (m, n]-hard verification task, it remains to be shown that L n-violates ϕ (cf.Def. 7).This can be done simply by means of an emptiness check for If it fails, we are guaranteed to have a violating prefix that is longer than m but shorter than or equal to n.Otherwise, we know that no (m, n]-hard verification task exists for Φ and ϕ, and we continue by heuristically modifying the parameters.

Example
The following example illustrates each step of realizing where Ω is the above-mentioned invariant that ensures that every Büchi automaton transition is labeled with exactly one atomic proposition (cf.[35]).In order to ease readability when this invariant is enforced, we abbreviate n transitions labeled b 1 , . . ., b n that share their sources and targets by a single transition labeled "b 1 | . . .|b n ".Figures 2 and 3 display the Büchi automata for Φ and ϕ , respectively, whereas Fig. 4 shows the Büchi automaton for the language that guarantees that there are no violating prefixes of length smaller or equal to m (cf.step 2 above).As ϕ and ϕ ≤m Σ ω do not feature a singleton invariant, they are an exception to our simplified representation.Spot [18], the library that we use for Büchi synthesis, uses a BDD representation for Büchi automaton transitions.Therefore, all labels within Figs. 3 and 4 have to be interpreted as BDDs and not in our simplified manner presented above.Note that because B (Fig. 4) is afterwards intersected with B (Fig. 2), the self loop at state 3 of the former does not need to specify the exact alphabet Σ.The Büchi automaton B res shown in Fig. 5 specifies already the desired language (cf.step 3 above), but it needs to be minimized to obfuscate the tree expansion step and to achieve scalability of subsequent transformations (cf.Fig. 6).

Experiments
To provide an impression of the scalability of our approach, we analyzed its execution time and the occurring numbers of states with Φ and ϕ given as in the previous section, but with increasing lower hardness bounds.This means in particular that B (cf. Fig. 2) and B ϕ (cf.Fig. 3) are maintained during our experiments.
The first column of Table 3 shows the hardness bound m which was set to 2 during the discussion of the previous section, while columns two and three summarize the number of states of the resulting automata before minimization (corresponding to Fig. 5) and after heuristic minimization (corresponding to Fig. 6).The fourth column provides the wall-clock execution time for computing the final heuristically minimized Büchi automata as well as the proportion of execution time which is needed for that minimization.
As one can see, the numbers are strictly increasing.This seems to indicate that the corresponding languages are continuously changing, or more precisely, continuously strictly decreasing.This is, however, not guaranteed, because the four-step construction via Spot may well provide two different Büchi automata for the same language.There is no canonicity.Thus, to be sure that one has a a valid (m, n]-hard verification task one still has to check whether the languages for n and m are indeed different.In the considered cases, this could always be verified.
Our C++ implementation utilizes the Spot library [18] for synthesizing, modifying, and optimizing Büchi automata.The execution times in Table 3 are based on our implementation that was executed on a machine running Arch Linux 5.5.13-arch2-1 and featuring an AMD Ryzen 3950X processor with 32GiB of RAM.

Conclusion and perspective
We have summarized the history of the RERS challenge for the analysis and verification of reactive systems and its objectives in two parts.In the first part, its profile and intentions, its relation to other competitions, and, in particular, its evolution due to the feedback of participants were discussed.This comprised, in particular, the discussion of 'oddities' like over-tuning: some participants tweak their tools to the sometimes concretely known solutions of the competitions' benchmarks, which leads to scores that have little to do with the tools' performance in realistic scenarios.This way, even winning a competition does not necessarily need to be a recommendation for potential users.
The second part presents our latest development with regard to the over-tuning problem: the automatic synthesis of benchmark problems with tailored difficulty in a requirement-driven' fashion.More precisely, since the beginning, the starting point of the RERS benchmark generation are desired structural properties, here formulated in LTL, which are successively transformed via Büchi automata that characterize all satisfying executions to Modal Transitions Systems and, with a few more steps, to code of various implementation languages (cf.Sec. 2).This way, RERS aims at benchmarks that closely resemble realistic code, but can be flexibly tailored in their degree of difficulty.The contribution of the second part is a way to tailor benchmarks according to the depths to which programs have to be investigated in order to find all errors.This approach gives benchmark designers a method to challenge contributors that try to perform well by excessive guessing, e.g., based on 'inappropriate' side knowledge.Combined with our traditional way of benchmark tailoring concerning the code/model size, the amount of arithmetic, and the used data structures as measure for intricacy, RERS provides benchmark designers with a very powerful engine that we plan to make available open source.
It should be noted that the ideas presented in this paper are not only applicable to the generation of benchmarks that feature sequential programs.Rather, they can also be applied during the generation of parallel benchmark problems.This means that we can provide not only SV-COMP and simi-lar competitions with tailored benchmark problems, but also competitions like MCC.
Funding Open Access funding enabled and organized by Projekt DEAL.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material.If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.

Fig. 6 B
Fig. 6 B res after minimization

Table 1
RERS Challenges 2010-2020 a Emphasized features brought or are expected to bring a permanent change to how benchmarks are generated for basic tracks

Table 2
Benchmark Synthesis in RERS.Property-driven generation (left) starts with a set of properties from which a model is synthesized.Model-driven generation (right) starts from a model.Mining and model checking or property extraction are used for generating challenge properties and expected verdicts.The model is expanded through semantics-preserving transformations.Code is generated from the model.The desired benchmark profile determines the extent of expansion and the language features used in the code

Table 3
State numbers and execution time (rounded up to two significant digits) of B res and minimized B res for the above example and different values of m