CVC 4 SY : Smart and Fast Term Enumeration for Syntax-Guided Synthesis

. We present CVC 4 SY , a syntax-guided synthesis (SyGuS) solver based on three bounded term enumeration strategies. The ﬁrst encodes term enumeration as an extension of the quantiﬁer-free theory of algebraic datatypes. The second is based on a highly optimized brute-force algorithm. The third combines elements of the others. Our implementation of the strategies within the satisﬁabil-ity modulo theories (SMT) solver CVC 4 and a heuristic to choose between them leads to signiﬁcant improvements over state-of-the-art SyGuS solvers.


Introduction
Syntax-guided synthesis (SyGuS) [3] is a recent paradigm for program synthesis, successfully used for applications in formal verification and programming languages.Most SyGuS solvers perform counterexample-guided inductive synthesis (CEGIS) [16]: a refinement loop in which a learner proposes solutions, and a verifier, generally a satisfiability modulo theories (SMT) solver [8,9], checks them and provides counterexamples for failures.Generally, the learner enumerates some set of terms, while pruning spurious ones [17].The simplicity and efficacy of enumerative SyGuS have made it the de facto approach for SyGuS, although alternatives exist for restricted fragments [4,14].
In previous work [14], we have shown how the SMT solver CVC4 [5] can itself act as an efficient synthesizer.This tool paper focuses on recent advances in the enumerative subsolver of CVC4, culminating in the current SyGuS solver CVC4SY.Figure 1 shows its main components.The term enumerator is parameterized by an enumeration strategy chosen before solving: CVC4SY S, whose constraint-based (smart) enumeration allows for numerous optimizations (Section 2); CVC4SY F, based on a new approach for (fast) enumerative synthesis (Section 3), which has significant advantages with respect to the enumerative solver CVC4SY S and other state-of-the-art approaches; and CVC4SY H, based on a hybrid approach combining smart and fast enumeration (Section 4).All strategies are fully integrated in CVC4, meaning they support inputs in many background theories, including arithmetic, bit-vectors, strings, and floating point.We evaluate these approaches on a large set of benchmarks (Section 5).
The Problem A syntax-guided synthesis problem for a function f in a background theory T consists of a set of semantic restrictions, or specification, for f given by a (second-order) T -formula of the form Df. ϕrf s, and a set of syntactic restrictions on the solutions for f , typically expressed as a context-free grammar.An enumerative approach to this problem combines a term enumerator and a solution verifier for solving synthesis conjectures.The role of the term enumerator is to output a stream of terms t 1 , t 2 , . . .over some tuple x of variables representing the inputs of f , where each t i rxs is a candidate solution.The role of the solution verifier is to check for each t i whether it is a solution for f by determining if the negated conjecture ϕrλx.ti s is unsatisfiable.
Bounded term generation considers terms based on an ordering such as term size (the number of non-nullary symbols in a term).For each k " 0, 1, 2, . .., the term enumerator outputs a finite set S k of terms, each of size at most k.Bounded term generation in CVC4SY is complete in the sense that, for any k, if f has a solution of size at most k, then at least one of the terms in S k is a solution for f .The effectiveness of an approach for (complete) bounded term generation can be evaluated based on two criteria: piq the number of terms it generates and piiq the rate at which it generates them.
We follow two approaches for enumerative SyGuS in CVC4SY, each optimized for one of the criteria above: a smart approach and a fast one.The first aims to generate reasonably quickly the smallest set of terms while maintaining completeness, while the second aims to generate terms as quickly as possible.
Technical Preliminaries As we showed in previous work [14], syntactic restrictions can be conveniently represented as a set of (algebraic) datatypes, for which some SMT solvers have dedicated decision procedures [7,13].For instance, given a function f : px : Intq ˆpy : Intq Ñ Int and the context-free grammar R below specifying what integer (I) and Boolean (B) terms can appear in candidate solutions for f : our SyGuS solver generates the following mutually recursive datatypes: Each datatype constructor corresponds to a production rule of R, e.g.plus corresponds to the rule I ::" I `I.A datatype term such as pluspx, yq represents the arithmetic term x `y.We will use these datatypes as a running example.For a datatype term t, we write is C ptq to denote the discriminator predicate that is satisfied exactly when t is interpreted as a datatype whose top constructor is C. We write sel τ n ptq to denote a shared selector [15] applied to t, interpreted as the n th child of t with type τ if one exists, and interpreted as an arbitrary element of τ otherwise.A term consisting of zero or more consecutive nested applications of shared selectors applied to a term t is a shared selector chain (for t).

Smart Enumerative SyGuS
Our smart enumerative SyGuS approach CVC4SY S, is based on finding solutions for an evolving set of constraints in an extension of the quantifier-free fragment of algebraic datatypes.These constraints are constructed to rule out many redundant solutions while not overconstraining the problem, potentially missing actual solutions.
In detail, candidate solutions for the function f : τ 1 Ñ τ 2 to be synthesized are constructed by maintaining a set of constraints F , initially empty, for a first-order variable d ranging over the datatype representing τ 2 .For example, consider again the function f with the syntactic restrictions expressed by the datatypes in Equations 3 and 4. If the term generator finds a model for F , it provides to the solution verifier the integer term which corresponds to the value of d in the model; for example, it provides x `1 when d is interpreted as pluspx, 1q.In turn, if the solution verifier finds that x `1 is not a solution, it provides the blocking constraint is plus pdq _ is x psel I 1 pdqq _ is 1 psel I 2 pdqq, i.e., the datatype constraint that rules out the current value for d, which is then added to F .This is a syntactic constraint on future candidate solutions from the term generator.Its atoms are discriminators applied to shared selector chains.
CVC4SY S uses a number of optimization techniques in addition to the basic loop above, which we describe in the remainder of this section.These techniques produce blocking constraints via the lemmas-on-demand paradigm [6] that eagerly rule out spurious candidates, prior to the solution verification step.Additionally, whenever possible, it strengthens blocking constraints via novel generalization techniques, with the effect of ruling out larger classes of candidates.

Blocking via Theory Rewriting with Structural Generalization
As we describe in previous work [14], the enumerative solver of CVC4 uses its rewriter as an oracle for discovering when candidate solutions are redundant.The motivation is that for any two equivalent terms t and s, only one of them needs to be checked with the solution verifier, since either both t and s are solutions to the synthesis conjecture or neither is.Given a term t, we write tÓ to denote its rewritten form.Note that it is possible for equivalent terms not to have the same rewritten form.This is a consequence of the trade-offs in the implementation of CVC4's rewriter, which must balance efficiency and completeness.
As an example, suppose that the term enumerator previously generated x`y and that d's current value is the datatype term representing y `x, where, however, px `yqÓ " py `xqÓ.We first generate a blocking constraint template Rrzs of the form is plus pzq_ , where z is a fresh variable.This template is subsequently instantiated with z Þ Ñ u for any shared selector chain u of type I that currently (or later) appears in F , starting with d itself.This has the effect of ruling out all candidate solutions that have y `x as a subterm, which is justified by the fact that each such term is equivalent to one in which all occurrences of y `x are replaced by x `y.
We employ a refinement of this technique, which we call theory rewriting with structural generalization, which searches for and then blocks only the minimal skeleton of the term under test that is sufficient for determining its rewritten form.For example, consider the if-then-else term t " itepx « 0 ^y ě 0, 0, xq, This term is equivalent to x, regardless of the value of predicate y ě 0. This can be confirmed by the rewriter by computing that itepx « 0 ^w, 0, xqÓ " x where w is a fresh Boolean variable.Then, instead of generating a constraint that blocks only (the datatype value corresponding to) t, we generate a stronger constraint that does not depend on the subterm y ě 0. In other words, this blocking constraint rules out all candidate solutions that contain the subterm itepx « 0^w, 0, xq, for any term w.We compute these generalizations using a recursive algorithm that iteratively replaces each subterm of the current candidate with a fresh variable, and checks whether its rewritten form remains the same.
Blocking via CEGIS with Structural Generalization Synthesis solvers based on CEGIS maintain a list of refinement points that witness the infeasibility of previous candidate solutions.That is, given a synthesis conjecture Df. @x.ϕrf, xs, the solver maintains a growing list p1 , . . ., pn of values for x that witness the infeasibility of previous candidates u 1 , . . ., u n for f .Then, when a new candidate u is generated, we first check whether ϕru, pi s is false for some i ď n.When a candidate u fails to satisfy ϕru, pi s, CVC4SY S further applies a form of generalization analogous to the structural generalization described above.We call this CEGIS with structural generalization, where the goal is to find the minimal skeleton of u that also fails to satisfy some refinement point.
For example, suppose f is the function to synthesize, ϕ includes the constraint f px, yq ď x ´1, and p 1 " p3, 3q is a refinement point.Then, the candidate term urx, ys " itepx ě 0, x, y `1q will be discarded, because itep3 ě 0, 3, 4q ę 2. Notice, however, that any candidate u 1 " itepx ě 0, x, wq is falsified by p 1 , regardless of what w is, since u 1 r3, 3s ď 2 is equivalent to 3 ď 2. This indicates that we can block all ite candidate terms with condition x ě 0 and true branch x.We can express this constraint in CVC4SY S by dropping the disjuncts that relate to the false branch of the ite term.This form of blocking is particularly useful when synthesizing multiple functions pf 1 , . . ., f n q, since it is often the case that a candidate for a single f i is already sufficient to falsify the specification, regardless of what the candidates for the other functions are.
Evaluation Unfolding This technique uses evaluation functions to encode the relationship between the datatype terms assigned to d and their analogs in the theory T .For example, the evaluation function for the datatype I defined in (3) is a function E I : I ˆInt ˆInt Þ Ñ Int defined axiomatically so that E I pd, m, nq denotes the result of evaluating d by interpreting any occurrences of x and y in d respectively as m and n and interpreting the other constructors as the corresponding arithmetic/Boolean operators, e.g.E I pminuspx, yq, 5, 3q is interpreted as 2. When a refinement point c is generated, we add a constraint requiring that the evaluation of d at c must satisfy the specification.For example, for conjecture Df. @x.f px `1, xq ď 0, and refinement point x Þ Ñ 1, we add the constraint E I pd, 2, 1q ď 0.Then, when a literal is C ptq is asserted for a term t of type I, we can add a constraint corresponding to the one-step unfolding of the evaluation of t.Specifically, when is ite pdq is asserted, we generate the constraint 1 pdq, 2, 1q, E I psel I 1 pdq, 2, 1q, E I psel I 2 pdq, 2, 1qq indicating that the evaluation of d on point p2, 1q indeed behaves like an ite term when d has top symbol ite.Our implementation adds these constraints for all terms t whose top symbols correspond to ite or Boolean connectives.For terms t whose top symbol is any of the other operators, we add constraints corresponding to their total evaluation of t when the value of t is fully determined, for example, t « pluspx, yq ñ E I pt, 2, 1q « 3. Notice this constraint with t " d along with the refinement constraint E I pd, 2, 1q ď 0 suffices to show that d cannot be pluspx, yq.

Fast Enumerative SyGuS
The techniques in the previous section prune the search space so that often, only a small subset of the entire possible set of terms is considered for a given term size bound.The main bottleneck, however, is managing the large number of blocking constraints generated.Moreover, the benefits of this approach are limited when the grammar or specification does not admit opportunities for generalization.
For this reason, we have also developed CVC4SY F, which, in the spirit of other SyGuS solvers (notably ESOLVER [17]), relies on a principled brute-force approach for term generation.In contrast to other solvers, however, which are built as layers on top of the core SMT reasoner, CVC4SY F is fully integrated as a subsolver of CVC4, so communication with other components has almost no overhead.This technique, fast enumerative synthesis, does not use constraint solving to generate new terms.As a result, the majority of optimizations from Section 2 are incompatible with it.
Algorithm To generate terms up to a given size k, we maintain a set S k τ of terms of type τ and size k for each datatype τ corresponding to a non-terminal symbol of our input grammar R. First, we compute for each such τ the set C τ of its constructor classes, an equivalence relation over the constructors of τ that groups them by their type.For example, the constructor classes for I are tx, y, 0, 1u, tplus, minusu and titeu.Then, we use the following procedure for generating all terms of size k for type τ :
The recursive procedure FASTENUM(τ , k) populates the set S k τ of all terms of type τ with size k.These sets are cached globally.We incorporate an optimization that only adds terms Cpt 1 , . . ., t n q to S k τ whose corresponding terms in the theory T are unique up to rewriting.This mimics the effect of blocking via theory rewriting as described in Section 2. For example, pluspy, xq is not added to S 1 I if that set already contains pluspx, yq, noting that px `yqÓ " py `xqÓ.By construction of S k τ for k ě 1, this has the cascading effect of excluding all terms having y `x as a subterm.
We observe that theory rewriting with structural generalization cannot be easily incorporated into this scheme since it requires the use of a constraint solver, something that the above algorithm seeks to avoid.

Hybrid Approach: Variable-Agnostic Enumerative SyGuS
We follow a third approach, in solver CVC4SY H, that combines elements of the previous approaches.The idea is to use the (smart) approach from Section 2 to generate terms, but then generate multiple candidate solutions from each term using a fast subprocedure we call a concretizer.We implement an instance of this scheme, which we call variable-agnostic term generation, that produces only terms that are unique modulo alpha-equivalence.In our running example, when a term t such as x `1 is produced, the concretizer produces all terms generated by the grammar R that are alphaequivalent to t, namely, tx `1, y `1u in this case.The advantage of this approach is that CVC4SY H can block any term whose variables are not canonically ordered; that is, assuming for instance that x ă y, it may block terms like 1 ´y and y `y, noting they are alpha-equivalent to 1 ´x and x `x, respectively.To implement this blocking scheme, we introduce unary Boolean predicates pre x and post x for each variable x in our grammar, where pre x (resp., post x ) holds for t if and only if variable x occurs in a depth-first left-to-right traversal of our candidate term before (resp., after) traversing to the position indicated by the selector chain t.We encode the semantics of these predicates based on the arguments of constructors in our signature, e.g. is plus pzq ñ ppre x pzq « pre x psel I 1 pzqq ^post x psel I 2 pzqq « post x pzqq.We then assert that pre x and pre y are false for our top-level variable d, and require is y pzq ñ pre x pzq for all z, stating that x must come before y in the traversal of any generated term.
This technique is useful for grammars with many variables, such as grammars in invariant synthesis problems, where the number of terms of small size is prohibitively large.Blocking based on theory rewriting (with generalization) from Section 2 is compatible with this technique and is used in CVC4SY H.However, the other optimizations are disabled, since they prune solutions in a way that is not agnostic to variables.

Evaluation
We evaluated the above techniques in CVC4SY on four benchmark sets: invariant synthesis benchmarks from the verification of Lustre [11] models; a set from work on synthesizing invertibility conditions for bit-vector operators [12] (IC-BV); a set of bit-vector invariant synthesis problems [2] (CegisT); and the SyGuS-COMP 2018 [1] benchmarks from five tracks: assorted problems (General), conditional linear arithmetic problems (CLIA), invariant synthesis problems (INV), and programming-by-examples problems [10] with a set over bit-vectors (PBE-BV) and another over strings (PBE-Str).We also considered separately the CrCi subset from General, which corresponds to cryptographic circuit synthesis.We ran our experiments on a cluster equipped with Intel E5-2637 v4 CPUs running Ubuntu 16.04, providing one core, 1800 seconds, and 8GB RAM for each job.Results are summarized in Table 1 and Figure 2. We denote the strategies from Sections 2, 3, and 4 by s, f and h, respectively (smart, fast, and hybrid); disabling the optimizations from Section 2 is marked by "-" and the suffixes r (rewriting), rg (rewriting with structural generalization), cg (CEGIS with structural generalization), and eu (evaluation unfolding).We also evaluated two meta-strategies of CVC4SY: a and a+si.The auto strategy a picks a strategy based on the properties of the problem: f for PBE problems and for problems without the Boolean type or the ite operator in their grammar and s otherwise.Strategy a+si uses the single-invocation solver [14] on problems that are amenable to quantifier elimination and a otherwise.We use the state-of-the-art SyGuS solver EUSOLVER [4] (EUS) as a baseline, but only for SyGuS-COMP benchmarks due to limitations in its parser.
Overall, strategy s excels on more challenging benchmark sets such as Lustre and Gen-Crci, while strategy f excels on the majority of the others.The gains for f are especially significant on PBE problems, where it outperforms both s and EUS by several orders of magnitude.Such gains are significant given that CVC4 won this track at SyGuS-COMP 2018 by employing s alone, and a variant of EUS won it in 2017.This result can be explained as a consequence of two factors.First, the string and bitvector grammars contain many operators with the same type, making the constructor class optimization of the f algorithm very effective.Second, although not described in this paper, all solvers in our evaluation use divide-and-conquer algorithms for PBE problems [4], which are not compatible with the optimizations cg and eu.The most important optimization for all CVC4SY strategies and with all benchmark sets is r.The optimization eu is especially effective when grammars contain ite and Boolean connectives, such as those in the Lustre set and in some subsets of General, on which we can see the biggest gains of s with respect to s-eu; cg is more helpful for IC-BV, with a few harder benchmarks only solved due to this technique.The first scatter plot in Figure 2 shows the advantage of h over s on Lustre, a benchmark set containing invariant synthesis problems with dozens of variables.We remark this configuration excels at quickly finding small solutions for problems with many variables, although solves fewer problems overall.The second scatter plot shows that while s takes significantly longer on easy problems, it outperforms f in the long run.The last two plots show that f significantly outperforms the state of the art on PBE benchmarks.
For all benchmark sets, the auto strategy a chooses the best enumerative strategy of CVC4SY with only a few exceptions, and hence it is the default configuration of CVC4SY.Due to specialized synthesis techniques [14,4], both a+si and EUS outperform the purely enumerative strategies of CVC4.This is reflected in the cactus plot on the commonly supported benchmark sets, where a and f solve more benchmarks than EUS for lower times but then EUS solves more benchmarks in the end.For a+si, the cactus plot shows that it outperforms EUS significantly.Nevertheless, we remark that a+si is able to solve only 393 (16%) of the overall benchmarks using only single invocation techniques.Hence, we conclude that both smart and fast enumerative strategies are critical subcomponents in our approach to syntax-guided synthesis.

Fig. 2 .
Fig.2.Cactus plot on commonly supported benchmark sets.The first scatter plot is for the Lustre set, the second for the Gen-Crci set, and the latter two for the 862 benchmarks from the PBE sets.

Table 1 .
Summary of number of problems solved per benchmark set.Best results are in bold.