Efficient Synthesis with Probabilistic Constraints

We consider the problem of synthesizing a program given a probabilistic specification of its desired behavior. Specifically, we study the recent paradigm of distribution-guided inductive synthesis (DIGITS), which iteratively calls a synthesizer on finite sample sets from a given distribution. We make theoretical and algorithmic contributions: (i) We prove the surprising result that DIGITS only requires a polynomial number of synthesizer calls in the size of the sample set, despite its ostensibly exponential behavior. (ii) We present a property-directed version of DIGITS that further reduces the number of synthesizer calls, drastically improving synthesis performance on a range of benchmarks.

The majority of the current work has focused on synthesis under Boolean constraints. However, often times we require the program to adhere to a probabilistic specification, e.g., a controller that succeeds with a high probability, a decision-making model operating over a probabilistic population model, a randomized algorithm ensuring privacy, etc. In this work, we are interested in (1) investigating probabilistic synthesis from a theoretical perspective and (2) developing efficient algorithmic techniques to tackle this problem.
Our starting point is our recent framework for probabilistic synthesis called distribution-guided inductive synthesis (digits) [1]. The digits framework is analogous in nature to the guess-and-check loop popularized by counterexampleguided approaches to synthesis and verification (cegis and cegar). The key idea of the algorithm is reducing the probabilistic synthesis problem to a nonprobabilistic one that can be solved using existing techniques, e.g., sat solvers. This is performed using the following loop: (1) approximating the input probability distribution with a finite sample set; (2) synthesizing a program for various possible output assignments of the finite sample set; and (3) invoking a probabilistic verifier to check if one of the synthesized programs indeed adheres to the given specification.
digits has been shown to theoretically converge to correct programs when they exist-thanks to learning-theory guarantees. The primary bottleneck of digits is the number of expensive calls to the synthesizer, which is ostensibly exponential in the size of the sample set. Motivated by this observation, this paper makes theoretical, algorithmic, and practical contributions: -On the theoretical side, we present a detailed analysis of digits and prove that it only requires a polynomial number of invocations of the synthesizer, explaining that the strong empirical performance of the algorithm is not merely due to the heuristics presented in [1] (Section 3). -On the algorithmic side, we develop an improved version of digits that is property-directed, in that it only invokes the synthesizer on instances that have a chance of resulting in a correct program, without sacrificing convergence. We call the new approach τ -digits (Section 4). -On the practical side, we implement τ -digits for sketch-based synthesis and demonstrate its ability to converge significantly faster than digits. We apply our technique to a range of benchmarks, including illustrative examples that elucidate our theoretical analysis, probabilistic repair problems of unfair programs, and probabilistic synthesis of controllers (Section 5).

An Overview of DIGITS
In this section, we present the synthesis problem, the digits [1] algorithm, and fundamental background on learning theory.

Probabilistic Synthesis Problem
Program Model. As discussed in [1], digits searches through some (infinite) set of programs, but it requires that the set of programs has finite VC dimension (we restate this condition in Section 2.3). Here we describe one constructive way of obtaining such sets of programs with finite VC dimension: we will consider sets of programs defined as program sketches [26] in the simple grammar from [1], where a program is written in a loop-free language, and "holes" defining the sketch replace some constant terminals in expressions. 1 The syntax of the language is defined below: Here, P is a program, V is the set of variables appearing in P , E (resp. B) is the set of linear arithmetic (resp. Boolean) expressions over V (where, again, constants in E and B can be replaced with holes), and V ← E is an assignment. We assume a vector v I of variables in V that are inputs to the program. We also assume there is a single Boolean variable v r ∈ V that is returned by the program. 2 All variables are real-valued or Boolean. Given a vector of constant values c, where |c| = |v I |, we use P (c) to denote the result of executing P on the input c. In our setting, the inputs to a program are distributed according to some joint probability distribution D over the variables v I . Semantically, a program P is denoted by a distribution transformer P , whose input is a distribution over values of v I and whose output is a distribution over v I and v r .
A program also has a probabilistic postcondition, post, defined as an inequality over terms of the form Pr[B], where B is a Boolean expression over v I and v r . Specifically, a probabilistic postcondition consists of Boolean combinations of the form e > c, where c ∈ R and e is an arithmetic expression over terms of the form Pr[B], e.g., Given a triple (P, D, post), we say that P is correct with respect to D and post, denoted P (D) |= post, iff post is true on the distribution P (D). . We can write inclusion in the interval as a (C-style) program (left) and consider a postcondition stating that the interval must include at least half the input probability mass (right): Synthesis Problem. digits outputs a program that is approximately "similar" to a given functional specification and that meets a postcondition. This functional specification is some input-output relation which we quantitatively want to match as closely as possible: specifically, we want to minimize the error of the output program P from the functional specificationP , defined as Er(P ) := Pr x∼D [P (x) =P (x)]. (Note that we represent the functional specification as a program.) The postcondition is Boolean, and therefore we always want it to be true. digits is guaranteed to converge whenever the space of solutions satisfying the postcondition is robust under small perturbations. The following definition captures this notion of robustness: Definition 1 (α-Robust Programs). Fix an input distribution D, a postcondition post, and a set of programs P. For any P ∈ P and any α > 0, denote the open α-ball centered at P as We can now state the synthesis problem solved by digits: Definition 2 (Synthesis Problem). Given an input distribution D, a set of programs P, a postcondition post, a functional specificationP ∈ P, and parameters α > 0 and 0 < ε α, the synthesis problem is to find a program P ∈ P such that P (D) |= post and such that any other α-robust P ′ has Er(P ) Er(P ′ )+ε.

A Naive DIGITS Algorithm
Algorithm 1 shows a simplified, naive version of digits, which employs a synthesizethen-verify approach. The idea of digits is to utilize non-probabilistic synthesis techniques to synthesize a set of programs, and then apply a probabilistic verification step to check if any of the synthesized programs is a solution.

Algorithm 1: Naive digits
Specifically, this "Naive digits" begins by sampling an appropriate number of inputs from the input distribution and stores them in the set S. Second, it iteratively explores each possible function f that maps the input samples to a Boolean and invokes a synthesis oracle to synthesize a program P that implements f , i.e. that satisfies the set of input-output examples in which each input x ∈ S is mapped to the output f (x). Naive digits then finds which of the synthesized programs satisfy the postcondition (the set res); we assume that we have access to a probabilistic verifier O ver to perform these computations. Finally, the algorithm outputs the program in the set res that has the lowest error with respect to the functional specification, once again assuming access to another oracle O err that can measure the error.
Note that the number of such functions f : S → {0, 1} is exponential in the size of |S|. As a "heuristic" to improve performance, the actual digits algorithm as presented in [1] employs an incremental trie-based search, which we describe (alongside our new algorithm, τ -digits) and analyze in Section 3. The naive version described here is, however, sufficient to discuss the convergence properties of the full algorithm.

Convergence Guarantees
digits is only guaranteed to converge when the program model P has finite VC dimension. 3 Intuitively, the VC dimension captures the expressiveness of the set of ({0, 1}-valued) programs P. Given a set of inputs S, we say that P shatters S iff, for every partition of S into sets S 0 ⊔ S 1 , there exists a program P ∈ P such that (i) for every x ∈ S 0 , P (x) = 0, and (ii) for every x ∈ S 1 , P (x) = 1.

Definition 3 (VC Dimension).
The VC dimension of a set of programs P is the largest integer d such that there exists a set of inputs S with cardinality d that is shattered by P.
To reiterate, suppose P * is a correct program with small error Er(P * ) = k; the convergence result follows two main points: (i) P * must be α-robust, meaning every P with Pr x∼D [P (x) = P * (x)] < α must also be correct, and therefore (ii) by synthesizing any P such that Pr x∼D [P (x) = P * (x)] ε where ε < α, then P is a correct program with error Er(P ) within k ± ε.

Understanding Convergence
The importance of finite VC dimension is due to the fact that the convergence statement borrows directly from probably approximately correct (PAC) learning. We will briefly discuss a core detail of efficient PAC learning that is relevant to understanding the convergence of digits (and, in turn, our analysis of τ -digits in Section 4), and refer the interested reader to Kearns and Vazirani's book [15] for a complete overview. Specifically, we consider the notion of an ε-net, which establishes the approximate-definability of a target program in terms of points in its input space.
Definition 4 (ε-net). Suppose P ∈ P is a target program, and points in its input domain X are distributed x ∼ D. For a fixed ε ∈ [0, 1], we say a set of points S ⊂ X is an ε-net for P (with respect to P and D) if for every P ′ ∈ P with Pr x∼D [P (x) = P ′ (x)] > ε there exists a witness x ∈ S such that P (x) = P ′ (x).
In other words, if S is an ε-net for P , and if P ′ "agrees" with P on all of S, then P and P ′ can only differ by at most ε probability mass.
Observe the relevance of ε-nets to the convergence of digits: the synthesis oracle is guaranteed not to "fail" by producing only programs ε-far from some ε-robust P * if the sample set happens to be an ε-net for P * . In fact, this observation is exactly the core of the PAC learning argument: having an ε-net exactly guarantees the approximate learnability.
A remarkable result of computational learning theory is that whenever P has finite VC dimension, the probability that m random samples fail to yield an ε-net becomes diminishingly small as m increases. Indeed, the given VCcost function used in Theorem 1 is a dual form of this latter result-that polynomially many samples are sufficient to form an ε-net with high probability.

The Trie-Based Search Strategy of DIGITS
Naive digits, as presented in Algorithm 1, performs a very unstructured, exponential search over the output labelings of the sampled inputs-i.e., the possible Boolean functions f in Algorithm 1. In our original paper [1] we present a "heuristic" implementation strategy that incrementally explores the set of possible output labelings using a trie data structure. In this section, we study the complexity of this technique through the lens of computational learning theory and discover the surprising result that digits requires a polynomial number of calls to the synthesizer in the size of the sample set! Our improved search algorithm (Section 4) inherits these results.
For the remainder of this paper, we use digits to refer to this incremental version. A full description is necessary for our analysis: Figure 1 (non-framed rules only) consists of a collection of guarded rules describing the construction of the trie used by digits to incrementally explore the set of possible output labelings. Our improved version, τ -digits (presented in Section 4), corresponds to the addition of the framed parts, but without them, the rules describe digits.
Nodes in the trie represent partial output labelings-i.e., functions f assigning Boolean values to only some of the samples in S = {x 1 , . . . , x m }. Each node is identified by a binary string σ = b 1 · · · b k (k can be smaller than m) denoting the path to the node from the root. The string σ also describes the partial output-labeling function f corresponding to the node-i.e., if the i-th bit b i is Hollow circles denote calls to Osyn that yield new programs; the cross denotes a call to Osyn that returns ⊥.
set to 1, then f (x i ) = true. The set explored represents the nodes in the trie built thus far; for each new node, the algorithm synthesizes a program consistent with the corresponding partial output function ("Explore" rules). The variable depth controls the incremental aspect of the search and represents the maximum length of any σ in explored ; it is incremented whenever all nodes up to that depth have been explored (the "Deepen" rule). The crucial part of the algorithm is that, if no program can be synthesized for the partial output function of a node identified by σ, the algorithm does not need to issue further synthesis queries for the descendants of σ. Figure 2 shows how digits builds a trie for an example run on the interval programs from Example 1, where we suppose we begin with an incorrect program describing the interval [0, 0.3]. Initially, we set the root program to [0, 0.3] (left figure). The "Deepen" rule applies, so a sample is added to the set of samplessuppose it's 0.4. "Explore" rules are then applied twice to build the children of the root: the child following the 0 branch needs to map 0.4 → 0, which [0, 0.3] already does, thus it is propagated to that child without asking O syn to perform a synthesis query. For the child following 1, we instead make a synthesis query, using the oracle O syn , for any value of a such that [0, a] maps 0.4 → 1-suppose it returns the solution a = 1, and we associate [0, 1] with this node. At this point we have exhausted depth 1 (middle figure), so "Deepen" once again applies, perhaps adding 0.6 to the sample set. At this depth (right figure), only two calls to O syn are made: in the case of the call at σ = 01, there is no value of a that causes both 0.4 → 0 and 0.6 → 1, so O syn returns ⊥, and we do not try to explore any children of this node in the future. The algorithm continues in this manner until a stopping condition is reached-e.g., enough samples are enumerated.

Polynomial Bound on the Number of Synthesis Queries
We observed in [1] that the trie-based exploration seems to be efficient in practice, despite potential exponential growth of the number of explored nodes in the trie as the depth of the search increases. The convergence analysis of digits relies on the finite VC dimension of the program model, but VC dimension itself is just a summary of the growth function, a function that describes a notion of complexity of the set of programs in question. We will see that the growth function much more precisely describes the behavior of the trie-based search; we will then use a classic result from computational learning theory to derive better bounds on the performance of the search. We define the growth function below, adapting the presentation from [15].
Definition 5 (Realizable Dichotomies). We are given a set P of programs representing functions from X → {0, 1} and a (finite) set of inputs S ⊂ X . We call any f : S → {0, 1} a dichotomy of S; if there exists a program P ∈ P that extends f to its full domain X , we call f a realizable dichotomy in P. We denote the set of realizable dichotomies as Observe that for any (infinite) set P and any finite set S that 1 |Π P (S)| 2 |S| . We define the growth function in terms of the realizable dichotomies: Definition 6 (Growth Function). The growth function is the maximal number of realizable dichotomies as a function of the number of samples, denoted Observe that P has VC dimension d if and only if d is the largest integer satisfyinĝ Π P (d) = 2 d (and infinite VC dimension whenΠ P (m) is identically 2 m )-in fact, VC dimension is often defined using this characterization. When digits terminates having used a sample set S, it has considered all the dichotomies of S: the programs it has enumerated exactly correspond to extensions of the realizable dichotomies Π P (S). The trie-based exploration is effectively trying to minimize the number of O syn queries performed on nonrealizable ones, but doing so without explicit knowledge of the full functional behavior of programs in P. In fact, it manages to stay relatively close to performing queries only on the realizable dichotomies: Proof. Let T d denote the total number of queries performed once depth d is completed. We perform no queries for the root, 4 thus T 0 = 0. Upon completing depth d − 1, the realizable dichotomies of {x 1 , . . . , x d−1 } exactly specify the nodes whose children will be explored at depth d. For each such node, one child is skipped due to solution propagation, while an oracle query is performed on the other, thus Connecting digits to the realizable dichotomies and, in turn, the growth function allows us to employ a remarkable result from computational learning theory, stating that the growth function for any set exhibits one of two asymptotic behaviors: it is either identically 2 m (infinite VC dimension) or dominated by a polynomial! This is commonly called the Sauer-Shelah Lemma [23,25]: Combining our lemma with this famous one yields a surprising result-that for a fixed set of programs P with finite VC dimension, the number of oracle queries performed by digits is guaranteedly polynomial in the depth of the search, where the degree of the polynomial is determined by the VC dimension: In short, the reason an execution of digits seems to enumerate a subexponential number of programs (as a function of the depth of the search) is because it literally must be polynomial. Furthermore, the algorithm performs oracle queries on nearly only those polynomially-many realizable dichotomies.

Example 3.
A digits run on the [0, a] programs as in Figure 2 using a sample set of size m will perform O(m 2 ) oracle queries, since the VC dimension of these intervals is 1. (In fact, every run of the algorithm on these programs will perform exactly 1 2 m(m + 1) many queries.) 4 Property-Directed τ -DIGITS digits has better convergence guarantees when it operates on larger sets of sampled inputs. In this section, we describe a new optimization of digits that reduces the number of synthesis queries performed by the algorithm so that it more quickly reaches higher depths in the trie, and thus allows to scale to larger samples sets. This optimized digits, called τ -digits, is shown in Figure 1 as the set of all the rules of digits plus the framed elements. The high-level idea is to skip synthesis queries that are (quantifiably) unlikely to result in optimal solutions. For example, if the functional specificationP maps every sampled input in S to 0, then the synthesis query on the mapping of every element of S to 1 becomes increasingly likely to result in programs that have maximal distance fromP as the size of S increases; hence the algorithm could probably avoid performing that query. In the following, we make use of the concept of Hamming distance between pairs of programs: Definition 7 (Hamming Distance). For any finite set of inputs S and any two programs P 1 , P 2 , we denote Hamming S (P 1 , P 2 ) := |{x ∈ S | P 1 (x) = P 2 (x)}| (we will also allow any {0, 1}-valued string to be an argument of Hamming S ).

Algorithm Description
Fix the given functional specificationP and suppose that there exists an ε-robust solution P * with (nearly) minimal error k = Er(P * ) := Pr x∼D [P (x) = P * (x)]; we would be happy to find any program P in P * 's ε-ball. Suppose we angelically know k a priori, and we thus restrict our search (for each depth m) only to constraint strings (i.e. σ in Figure 1) that have Hamming distance not much larger than km.
To be specific, we first fix some threshold τ ∈ (k, 1]. Intuitively, the optimization corresponds to modifying digits to consider only paths σ through the trie such that Hamming S (P , σ) τ |S|. This is performed using the unblocked function in Figure 1. Since we are ignoring certain paths through the trie, we need to ask: How much does this decrease the probability of the algorithm succeeding? -It depends on the tightness of the threshold, which we address in Section 4.2. In Section 4.3, we discuss how to adaptively modify the threshold τ as τ -digits is executing, which is useful when a good τ is unknown a priori.

Analyzing Failure Probability with Thresholding
Using τ -digits, the choice of τ will affect both (i) how many synthesis queries are performed, and (ii) the likelihood that we miss optimal solutions; in this section we explore the latter point. 5 Interestingly, we will see that all of the analysis is dependent only on parameters directly related to the threshold; notably, none of this analysis is dependent on the complexity of P (i.e. its VC dimension).
If we really want to learn (something close to) a program P * , then we should use a value of the threshold τ such that Pr S∼D m [Hamming S (P , P * ) τ m] is large-to do so requires knowledge of the distribution of Hamming S (P , P * ). Recall the binomial distribution: for parameters (n, p), it describes the number of successes in n-many trials of an experiment that has success probability p.
Claim. Fix P and let k = Pr x∼D [P (x) = P (x)]. If S is sampled from D m , then Hamming S (P , P ) is binomially distributed with parameters (m, k).
Next, we will use our knowledge of this distribution to reason about the failure probability, i.e. that τ -digits does not preserve the convergence result of digits.
The simplest argument we can make is a union-bound style argument: the thresholded algorithm can "fail" by (i) failing to sample an ε-net, or otherwise (ii) sampling a set on which the optimal solution has a Hamming distance that is not representative of its actual distance. We provide the quantification of this failure probability in the following theorem: Theorem 3. Let P * be a target ε-robust program with k = Pr x∼D [P (x) = P * (x)], and let δ be the probability that m samples do not form an ε-net for P * . If we run the τ -digits with τ ∈ (k, 1], then the failure probability is at most δ+Pr[X > τ m] where X ∼ Binomial(m, k).
In other words, we can use tail probabilities of the binomial distribution to bound the probability that the threshold causes us to "miss" a desirable program we otherwise would have enumerated. Explicitly, we have the following corollary: Informally, when m is not too small, k is not too large, and τ is reasonably forgiving, these tail probabilities can be quite small. We can even analyze the asymptotic behavior by using any existing upper bounds on the binomial distribution's tail probabilities-importantly, the additional error diminishes exponentially as m increases, dependent on the size of τ relative to k. As stated at the beginning of this subsection, the balancing act is to choose τ (i) small enough so that the algorithm is still fast for large m, yet (ii) large enough so that the algorithm is still likely to learn the desired programs. The further challenge is to relax our initial strong assumption that we know the optimal k a priori when determining τ , which we address in the following subsection.

Adaptive Threshold
Of course, we do not have the angelic knowledge that lets us pick an ideal threshold τ ; the only absolutely sound choice we can make is the trivial τ = 1. Fortunately, we can begin with this choice of τ and adaptively refine it as the search progresses. Specifically, every time we encounter a correct program P such that k = Er(P ), we can refine τ to reflect our newfound knowledge that "the best solution has distance of at most k." We refer to this refinement as adaptive τ -digits. The modification involves the addition of the following rule to Figure 1: We can use any (non-decreasing) function g to update the threshold τ ← g(k). The simplest choice would be the identity function (which we use in our experiments), although one could use a looser function so as not to over-prune the search. If we choose functions of the form g(k) = k + b, then Corollary 2 allows us to make (slightly weak) claims of the following form: Claim. Suppose the adaptive algorithm completes a search of up to depth m yielding a best solution with error k (so we have the final threshold value τ = k + b). Suppose also that P * is an optimal ε-robust program at distance k − η. The optimization-added failure probability (as in Corollary 1) for a run of (nonadaptive) τ -digits completing depth m and using this τ is at most e −2m(b+η) 2 .

Evaluation
Implementation. In this section, we evaluate our new algorithm τ -digits (Figure 1) and its adaptive variant (Section 4.3) against digits (i.e., τ -digits with τ = 1). Both algorithms are implemented in Python and use the SMT solver Z3 [8] to implement a sketch-based synthesizer O syn . We employ statistical verification for O ver and O err : we use Hoeffding's inequality for estimating probabilities in post and Er. Probabilities are computed with 95% confidence, leaving our oracles potentially unsound.
Research Questions. Our evaluation aims to answer the following questions: RQ1 Is adaptive τ -digits more effective/precise than τ -digits? RQ2 Is τ -digits more effective/precise than digits? RQ3 Can τ -digits solve challenging synthesis problems?
We experiment on three sets of benchmarks: (i) synthetic examples for which the optimal solutions can be computed analytically (Section 5.1), (ii) the set of benchmarks considered in the original digits paper (Section 5.2), (iii) a variant of the thermostat-controller synthesis problem presented in [7] (Section 5.3).

Synthetic Benchmarks
We consider a class of synthetic programs for which we can compute the optimal solution exactly; this lets us compare the results of our implementation to an ideal baseline. Here, the program model P is defined as the set of axis-aligned hyperrectangles within [−1, 1] d (d ∈ {1, 2, 3} and the VC dimension is 2d), and the input distribution D is such that inputs are distributed uniformly over [−1, 1] d . We fix some probability mass b ∈ {0.05, 0.1, 0.2} and define the benchmarks so that the best error for a correct solution is exactly b (see Appendix B).
We run our implementation using thresholds τ ∈ {0.07, 0.15, 0.3, 0.5, 1}, omitting those values for which τ < b; additionally, we also consider an adaptive run where τ is initialized as the value 1, and whenever a new best solution is enumerated with error k, we update τ ← k. Each combination of parameters was run for a period of 2 minutes.  (d, b).) By studying Figure 3 we see that the adaptive threshold search performs at least as well as the tight thresholds fixed a priori because reasonable solutions are found early. In fact, all search configurations find solutions very close to the optimal error (indicated by the horizontal dashed line). Regardless, they reach different depths, and the main advantage of reaching large depths concerns the strength of the optimality guarantee. Note, also, that small τ values are necessary to see improvements in the completed depth of the search. Indeed, the discrepancy between the depth-versus-time functions diminishes drastically for the problem instances with larger values of b (see Appendix B); the gains of the optimization are contingent on the existence of correct solutions close to the functional specification.
Findings (RQ1): τ -digits does tend to find reasonable solutions at early depths and near-optimal solutions at later depths, thus adaptive τ -digits is more effective than τ -digits, and we use it throughout our remaining experiments.

Original DIGITS Benchmarks
The original digits paper [1] evaluates on a set of 18 repair problems of varying complexity. The functional specifications are machine-learned decision trees and support vector machines, and each search space P involves the set of programs formed by replacing some number of real-valued constants in the program with holes. The postcondition is a form of algorithmic fairness-e.g., the program should output true on inputs of type A as often as it does on inputs of type B [10]. For each such repair problem, we run both digits and adaptive τ -digits (again, with initial τ = 1 and the identity refinement function). Each benchmark is run for 10 minutes, where the same sample set is used for both algorithms. Figure 4 shows, for each benchmark, (i) the largest sample set size completed by adaptive τ -digits versus digits (left-above the diagonal line indicates adaptive τ -digits reaches further depths), and (ii) the error of the best solution found by adaptive τ -digits versus digits (right-below the diagonal line indicates adaptive τ -digits finds better solutions). We see that adaptive τ -digits reaches further depths on every problem instance, many of which are substantial improvements, and that it finds better solutions on 10 of the 18 problems. For those which did not improve, either the search was already deep enough that digits was able to find near-optimal solutions, or the complexity of the synthesis queries is such that the search is still constrained to small depths.
Findings (RQ2): Adaptive τ -digits can find better solutions than those found by digits and can reach greater search depths.

Thermostat Controller
We challenge adaptive τ -digits with the task of synthesizing a thermostat controller, borrowing the benchmark from [7]. The input to the controller is the initial temperature of the environment; since the world is uncertain, there is a specified probability distribution over the temperatures. The controller itself is a program sketch consisting primarily of a single main loop: iterations of the loop correspond to timesteps, during which the synthesized parameters dictate an incremental update made by the thermostat based on the current temperature. The loop runs for 40 iterations, then terminates, returning the absolute value of the difference between its final actual temperature and the target temperature.
The postcondition is a Boolean probabilistic correctness property intuitively corresponding to controller safety, e.g. with high probability, the temperature should never exceed certain thresholds. In [7], there is a quantitative objective in the form of minimizing the expected value E[|actual − target |]-our setting does not admit optimizing with respect to expectations, so we must modify the problem. Instead, we fix some value N (N ∈ {2, 4, 8}) and have the program return 0 when |actual − target| < N and 1 otherwise. Our quantitative objective is to minimize the error from the constant-zero functional specificationP (x) := 0 (i.e. the actual temperature always gets close enough to the target). The full specification of the controller is provided in Appendix C. We consider variants of the program where the thermostat runs for fewer timesteps and try loop unrollings of size {5, 10, 20, 40}. We run each benchmark for 10 minutes: the final completed search depths and best error of solutions are shown in Figure 5. For this particular experiment, we use the SMT solver CVC4 [3] because it performs better than Z3 on the occurring SMT instances.
As we would expect, for larger values of N it is "easier" for the thermostat to reach the target temperature threshold and thus the quality of the best solution increases in N . However, with small unrollings (i.e. 5) the synthesized controllers do not have enough iterations (time) to modify the temperature enough for the probability mass of extremal temperatures to reach the target: as we increase the number of unrollings to 10, we see that better solutions can be found since the set of programs are capable of stronger behavior.
On the other hand, the completed depth of the search plummets as the unrolling increases due to the complexity of the O syn queries. Consequently, for 20 and 40 unrollings, adaptive τ -digits synthesizes worse solutions because it cannot reach the necessary depths to obtain better guarantees.
One final point of note is that for N = 8 and 10 unrollings, it seems that there is a sharp spike in the completed depth. However, this is somewhat artificial: because N = 8 creates a very lenient quantitative objective, an early O syn query happens to yield a program with an error less than 10 −3 . Adaptive τ -digits then updates τ ←≈ 10 −3 and skips most synthesis queries.
Findings (RQ3): Adaptive τ -digits can synthesize small variants of a complex thermostat controller, but cannot solve variants with many loop iterations.
Synthesis & Probability. Program synthesis is a mature area with many powerful techniques. The primary focus is on synthesis under Boolean constraints, and probabilistic specifications have received less attention [1,7,18,16]. We discuss the works that are most related to ours.
digits [1] is the most relevant work. First, we show for the first time that digits only requires a number of synthesis queries polynomial in the number of samples. Second, our adaptive τ -digits further reduces the number of synthesis queries required to solve a synthesis problem without sacrificing correctness.
The technique of smoothed proof search [7] approximates a combination of functional correctness and maximization of an expected value as a smooth, continuous function. It then uses numerical methods to find a local optimum of this function, which translates to a synthesized program that is likely to be correct and locally maximal. The benchmarks described in Section 5.3 are variants of benchmarks from [7]. Smoothed proof search can minimize expectation; τdigits minimizes probability only. However, unlike τ -digits, smoothed proof search lacks formal convergence guarantees and cannot support the rich probabilistic postconditions we support, e.g., as in the fairness benchmarks.
Works on synthesis of probabilistic programs are aimed at a different problem [18, 6,22]: that of synthesizing a generative model of data. For example, Nori et al.
[18] use sketches of probabilistic programs and complete them with a stochastic search. Recently, Saad et al. [22] synthesize an ensemble of probabilistic programs for learning Gaussian processes and other models.
Kǔcera et al. [16] present a technique for automatically synthesizing program transformations that introduce uncertainty into a given program with the goal of satisfying given privacy policies-e.g., preventing information leaks. They leverage the specific structure of their problem to reduce it to an SMT constraint solving problem. The problem tackled in [16] is orthogonal to the one targeted in this paper and the techniques are therefore very different.
Stochastic Satisfiability. Our problem is closely related to e-majsat [17], a special case of stochastic satisfiability (ssat) [19] and a means for formalizing probabilistic planning problems. e-majsat is of np pp complexity. An e-majsat formula has deterministic and probabilistic variables. The goal is to find an assignment of deterministic variables such that the probability that the formula is satisfied is above a given threshold. Our setting is similar, but we operate over complex program statements and have an additional optimization objective (i.e., the program should be close to the functional specification). The deterministic variables in our setting are the holes defining the search space; the probabilistic variables are program inputs.

A.1 Main Theorem
Proof (Theorem 2). Let S be the set of samples, with |S| = m. By Lemma 1, the number of queries is at most |S||Π P (S)|, which is in turn at most mΠ P (m). Applying Lemma 2 immediately gives us the O(m d+1 ) bound.

A.2 Interval Details
Here we expand on the details related to the set of interval programs ( Figure 2) that were elided in the various examples in Section 3. Technically, this claim is not correct: when 0 ∈ S, the number of dichotomies is one fewer. However, S is obtained by sampling from a distribution, and if the distribution over [0, 1] does not contain atoms, then this case almost surely does not happen. We omit this detail for simpler presentation throughout.
Proof. Let the elements of S = {x 1 , . . . , x m } be ordered increasingly; there are exactly |S| + 1 equivalence classes of programs based on the choice of a: one from a < x 1 , one from a > x m , and (|S| − 1)-many from x i < a < x i+1 for i ∈ {1, . . . , m − 1}.

B Varying Synthetic Problem Parameters
In this section, we provide the complete description of the synthetic benchmarks and present the complete plots of our evaluation. We consider a class of hyperrectangle programs for which we can compute the optimal solution exactly; this lets us compare the results of our implementation to an ideal baseline. Here, the concept class P (i.e., the set of programs) is defined as the set of axis-aligned hyperrectangles within [−1, 1] d , and the input distribution D is such that inputs are distributed uniformly over [−1, 1] d . We fix some probability mass b and aim to synthesize a program that is close to a functional specification of the form 0 x 1 2b ∧ i∈{2,...,d} −1 x i 1, which only returns 1 for points whose first coordinate is positive and at most 2b. We fix the following postcondition: In other words, a correct hyperrectangle must include as much probability mass of points whose first coordinate is negative as it does for those with a positive first coordinate, and additionally it must include at least as much probability mass as the original hyperrectangle. Observe that independent of d, the best error for a correct solution is exactly b (and there exist dense regions of α-robust programs that have error b + α).
We consider problem instances formed from combinations of d ∈ {1, 2, 3} and b ∈ {0.05, 0.1, 0.2}. As d increases, the set of programs increases in complexity (in fact, it has VC dimension 2d) and the synthesis queries become more expensive. As b increases, the threshold used by the optimization cannot be as small, so we expect the search to benefit less from our optimizations. We run our implementation using thresholds τ ∈ {0.07, 0.15, 0.3, 0.5, 1}, omitting those values for which τ < b; additionally, we also consider an adaptive run where τ is initialized as the value 1, and whenever a new best solution is enumerated with error k we update τ ← k.
Each combination of parameters was run for a period of 2 minutes. Figure 6 shows each of the following as a function of time: (i) the depth completed by the search (i.e. the current size of the sample set), and (ii) the best solution found by the search.

C Thermostat Benchmark
Here we include the specification of our modified version of the thermostat controller synthesis benchmark [7]. Figure 7 shows the definitions of pre, which describes the probability distribution D over the inputs, and thermostat, a program sketch describing the set of possible programs. We handle the thermostat loop (line 11) through syntactic unrolling, since it has a constant bound: Unrollings is the value we instantiate from {5, 10, 20, 40} in the creating of problem instances for our experiments. Similarly, the threshold N in line 26 is instantiated from {2, 4, 8}.
A synthesized program instantiates the sketch by replacing the holes with real-valued constants: for example, the syntax in the thermostat definition at line 2 specifies that the synthesizer must replace the right side of the assignment with a constant between 0 and 10. The assert statements form the probabilistic postcondition: if we have the set of assert statements in the program {assert(event i ; θ i ); } i∈I , then the postcondition is given by the following conjunction: i∈I Pr[event i ] > θ i . (Recall that the loop is syntactically unrolled and observe that all execution paths encounter all assert statements, so this is well-defined.)