Solving String Constraints Using SAT

. String solvers are automated-reasoning tools that can solve combinatorial problems over formal languages. They typically operate on restricted first-order logic formulas that include operations such as string concatenation, substring relationship, and regular expression matching. String solving thus amounts to deciding the satisfiability of such formulas. While there exists a variety of different string solvers, many string problems cannot be solved efficiently by any of them. We present a new approach to string solving that encodes input problems into propositional logic and leverages incremental SAT solving. We evaluate our approach on a broad set of benchmarks. On the logical fragment that our tool supports, it is competitive with state-of-the-art solvers. Our experiments also demonstrate that an eager SAT-based approach complements existing approaches to string solving in this specific fragment.


Introduction
Many problems in software verification require reasoning about strings. To tackle these problems, numerous string solvers-automated decision procedures for quantifier-free first-order theories of strings and string operations-have been developed over the last years. These solvers form the workhorse of automatedreasoning tools in several domains, including web-application security [19,31,33], software model checking [15], and conformance checking for cloud-access-control policies [2,30].
The general theory of strings relies on deep results in combinatorics on words [23,29,16,5]; unfortunately, the related decision procedures remain intractable in practice. Practical string solvers achieve scalability through a judicious mix of heuristics and restrictions on the language of constraints.
We present a new approach to string solving that relies on an eager reduction to the Boolean satisfiability problem (SAT), using incremental solving and unsatisfiable-core analysis for completeness and scalability. Our approach supports a theory that contains Boolean combinations of regular membership constraints and equality constraints on string variables, and captures a large set of practical queries [6].
Our solving method iteratively searches for satisfying assignments up to a length bound on each string variable; it stops and reports unsatisfiability when the search reaches computed upper bounds without finding a solution. Similar to the solver Woorpje [12], we formulate regular membership constraints as reachability problems in nondeterministic finite automata. By bounding the number of transitions allowed by each automaton, we obtain a finite problem that we encode into propositional logic. To cut down the search space of the underlying SAT problem, we perform an alphabet reduction step (SMT-LIB string constraints are defined over an alphabet of 3 · 2 16 letters and a naive reduction to SAT does not scale). Inspired by bounded model checking [8], we iteratively increase bounds and utilize an incremental SAT solver to solve the resulting series of propositional formulas. We perform an unsatisfiable-core analysis after each unsatisfiable incremental call to increase only the bounds of a minimal subset of variables until a theoretical upper bound is reached.
We have evaluated our solver on a large set of benchmarks. The results show that our SAT-based approach is competitive with state-of-the-art SMT solvers in the logical fragment that we support. It is particularly effective on satisfiable instances.
Closest to our work is the Woorpje solver [12], which also employs an eager reduction to SAT. Woorpje reduces systems of word equations with linear constraints to a single Boolean formula and calls a SAT solver. An extension can also handle regular membership constraints [21]. However, Woorpje does not handle the full language of constraints considered here and does not employ the reduction and incremental solving techniques that make our tool scale in practice. More importantly, in contrast to our solver, Woorpje is not complete-it does not terminate on unsatisfiable instances.
Other solvers such as Hampi [19] and Kaluza [31] encode string problems into constraints on fixed-size bit-vector, which can be solved by reduction to SAT. These tools support expressive constraints but they require a user-provided bound on the length of string variables.
Further from our work are approaches based on the lazy SMT paradigm, which tightly integrates dedicated, heuristic, theory solvers for strings using the CDCL(T) architecture (also called DPLL(T) in early papers). Solvers that follow this paradigm include Ostrich [11], Z3 [25], Z3str4 [24], cvc5 [3], Z3str3RE [7], Trau [1], and CertiStr [17]. Our evaluation shows that our eager approach is competitive with lazy solvers overall, but it also shows that combining both types of solvers in a portfolio is most effective. Our eager approach tends to perform best on satisfiable instances while lazy approaches work better on unsatisfiable problems.

Preliminaries
We assume a fixed alphabet Σ and a fixed set of variables Γ . Words of Σ * are denoted by w, w ′ , w ′′ , etc. Variables are denoted by x, y, z. Our decision procedure supports the theory described in Figure 1. Atoms in this theory include regular membership constraints (or regular constraints for short) of the form x . ∈ RE, where RE is a regular expression, and variable equations of the form x . = y. Concatenation is not allowed in equations.
Regular expressions are defined inductively using union, concatenation, intersection, and the Kleene star. Atomic regular expressions are constant words w ∈ Σ * and the wildcard character ?, which is a placeholder for an arbitrary symbol c ∈ Σ. All regular expressions are grounded, meaning that they do not contain variables. We use the symbols ̸ . ∈ and ̸ . = as a shorthand notation for negations of atoms using the respective predicate symbols. The following is an example formula in our language: ∈ a · b. Using our basic syntax, we can define additional relations, such as constant equations x . = w, and prefix and suffix constraints, written w . ⊑ x and w . ⊒ x, respectively. Even though these relations can be expressed as regular constraints (e.g., the prefix constraint ab . ⊑ x can be expressed as x . ∈ a · b · ? * ), we can generate more efficient reductions to SAT by encoding them explicitly.
This string theory is not as expressive as others, since it does not include string concatenation, but it still has important practical applications. It is used in the Zelkova tool described by Backes, et al. [2] to support analysis of AWS security policies. Zelkova is a major industrial application of SMT solvers [30].
Given a formula ψ, we denote by atoms(ψ) the set of atoms occurring in ψ, by V (ψ) the set of variables occurring in ψ, and by Σ(ψ) the set of constant symbols occurring in ψ. We call Σ(ψ) the alphabet of ψ. Similarly, given a regular expression R, we denote by Σ(R) the set of characters occurring in R. In particular, we have Σ(?) = ∅.
We call a formula conjunctive if it is a conjunction of literals and we call it a clause if it is a disjunction of literals. We say that a formula is in normal form if it is a conjunctive formula without unnegated variable equations. Every conjunctive formula can be turned into normal form by substitution, i.e., by repeatedly rewriting ψ ∧ x . = y to ψ[x := y]. If ψ is in negation normal form (NNF), meaning that the negation symbol occurs only directly in front of atoms, we denote by lits(ψ) the set of literals occurring in ψ. We say that an atom a occurs with positive polarity in ψ if a ∈ lits(ψ) and that it occurs with negative polarity in ψ if ¬a ∈ lits(ψ); we denote the respective sets of atoms of ψ by atoms + (ψ) and atoms − (ψ). The notion of polarity can be extended to arbitrary formulas (not necessarily in NNF), intuitively by considering polarity in a formula's corresponding NNF (see [26] for details). Fig. 2: Overview of the solving process.
The semantics of our language is standard. A regular expression R defines a regular language L(R) over Σ in the usual way. An interpretation is a mapping (also called a substitution) h : Γ → Σ * from string variables to words. Atoms are interpreted as usual, and a model (also called a solution) is an interpretation that makes a formula evaluate to true under the usual semantics of the Boolean connectives.

Overview
Our solving method is illustrated in Figure 2. It first performs three preprocessing steps that generate a Boolean abstraction of the input formula, reduce the size of the input alphabet, and initialize bounds on the lengths of all string variables. After preprocessing, we enter an encode-solve-and-refine loop that iteratively queries a SAT solver with a problem encoding based on the current bounds and refines the bounds after each unsatisfiable solver call. We repeat this loop until either the propositional encoding is satisfiable, in which case we conclude satisfiability of the input formula, or each bound has reached a theoretical upper bound, in which case we conclude unsatisfiability.
Generating the Boolean Abstraction. We abstract the input formula ψ by replacing each theory atom a ∈ atoms(ψ) with a new Boolean variable d(a), and keep track of the mapping between a and d(a). This gives us a Boolean abstraction ψ A of ψ and a set D of definitions, where each definition expresses the relationship between an atom a and its corresponding Boolean variable d(a). If a occurs with only one polarity in ψ, we encode the corresponding definition as an implication, i.e., as d(a) → a or as ¬ d(a) → ¬a, depending on the polarity of a. Otherwise, if a occurs with both polarities, we encode it as an equivalence consisting of both implications. This encoding, which is based on ideas behind the well-known Plaisted-Greenbaum transformation [28], ensures that the formulas ψ and ψ A ∧ d∈D d are equisatisfiable. An example is shown in Figure 3.
Reducing the Alphabet. In the SMT-LIB theory of strings [4], the alphabet Σ comprises 3·2 16 letters, but we can typically use a much smaller alphabet without Fig. 3: Example of Boolean abstraction. The formula ψ, whose expression tree is shown on the left, results in the Boolean abstraction illustrated on the right, where p, q, and r are fresh Boolean variables. We additionally get the definitions ∈ R 2 , and r ↔ z . = w. We use an implication (instead of an equivalence) for atom x . ∈ R 1 since it occurs only with positive polarity within ψ.
affecting satisfiability. In Section 4, we show that using Σ(ψ) and one extra character per string variable is sufficient. Reducing the alphabet is critical for our SAT encoding to be practical.
Initializing Bounds. A model for the original first-order formula ψ is a substitution h : Γ → Σ * that maps each string variable to a word of arbitrary length such that ψ evaluates to true. As we use a SAT solver to find such substitutions, we need to bound the lengths of strings, which we do by defining a bound function b : Γ → N that assigns an upper bound to each string variable. We initialize a small upper bound for each variable, relying on simple heuristics. If the bounds are too small, we increase them in a later refinement step.
Encoding, Solving, and Refining Bounds. Given a bound function b, we build a propositional formula ψ b that is satisfiable if and only if the original formula ψ has a solution h such that |h(x)| ≤ b(x) for all x ∈ Γ . We encode ψ b as the is an encoding of the definitions D, and h b is an encoding of the set of possible substitutions. We discuss details of the encoding in Section 5. A key property is that it relies on incremental SAT solving under assumptions [13]. Increasing bounds amounts to adding new clauses to the formula ψ b and fixing a set of assumptions, i.e., temporarily fixing the truth values of a set of Boolean variables. If ψ b is satisfiable, we can construct a substitution h from a Boolean model Otherwise, we examine an unsatisfiable core (i.e., an unsatisfiable subformula) of ψ b to determine whether increasing the bounds may give a solution and, if so, to identify the variables whose bounds must be increased. In Section 6, we explain in detail how we analyze unsatisfiable cores, increase bounds, and conclude unsatisfiability.

Reducing the Alphabet
In many applications, the alphabet Σ is large-typically Unicode or an approximation of Unicode as defined in the SMT-LIB standard-but formulas use much fewer symbols (less than 100 symbols is common in our experiments). In order to check the satisfiability of a formula ψ, we can restrict the alphabet to the symbols that occur in ψ and add one extra character per variable. This allows us to produce compact propositional encodings that can be solved efficiently in practice.
To prove that such a reduced alphabet A is sufficient, we show that a model h : Γ → Σ * of ψ can be transformed into a model h ′ : Γ → A * of ψ by replacing characters of Σ that do not occur in ψ by new symbols-one new symbol per variable of ψ.
More generally, assume B is a subset of Σ and n is a positive integer such that |B| ≤ |Σ| − n. We can then pick n distinct symbols . This construction satisfies the following property: . . , f n be mappings as defined above, and let i, j ∈ 1, . . . , n such that i ̸ = j. Then, the following holds: Proof. The first part is an easy case analysis. For the second part, we have that |f i (w)| = |w| and |f j (w ′ )| = |w ′ |, so the statement holds if w and w ′ have different lengths. Assume now that w and w ′ have the same length and let v be the longest common prefix of w and w ′ . Since w and w ′ are distinct, we have that w = v · a · u and w ′ = v · b · u ′ , where a ̸ = b are symbols of Σ and u and u ′ are words of Σ * . By the first part, we have The following lemma can be proved by induction on R.
. . , f n be mappings as defined above and let R be a regular expression with Σ(R) ⊆ B. Then, for all words w ∈ Σ * and all i ∈ 1, . . . , n,

Given a subset
We can now prove the main theorem of this section, which shows how to reduce the alphabet while maintaining satisfiability. Proof. We set B = Σ(ψ) and use the previous construction. So the alphabet We can assume that ψ is in disjunctive normal form, meaning that it is a disjunction of the form ψ = ψ 1 ∨ · · · ∨ ψ m , where each ψ t is a conjunctive formula. If ψ is satisfiable, then one of the disjuncts ψ k is satisfiable and we have Σ(ψ k ) ⊆ B. We can turn ψ k into normal form by eliminating all variable equalities of the form We have three cases: The reduction presented here can be improved and generalized. For example, it can be worthwhile to use different alphabets for different variables or to reduce large character intervals to smaller sets.

Propositional Encodings
Our algorithm performs a series of calls to a SAT solver. Each call determines the satisfiability of the propositional encoding ψ b of ψ for some upper bounds b.
is an encoding of the set of possible substitutions, and D b is an encoding of the theory-literal definitions, both bounded by b. Intuitively, h b tells the SAT solver to "guess" a substitution, D b makes sure that all theory literals are assigned proper truth values according to the substitution, and ψ A forces the evaluation of the whole formula under these truth values.
Suppose the algorithm performs n calls and let b k : Γ → N for k ∈ 1, . . . , n denote the upper bounds used in the k-th call to the SAT solver. For convenience, we additionally define b 0 (x) = 0 for all x ∈ Γ . In the k-th call, the SAT solver decides whether ψ b k is satisfiable. The Boolean abstraction ψ A , which we already discussed in Section 3, stays the same for each call. In the following, we thus discuss the encodings of the substitutions h b k and of the various theory literals a b k and ¬a b k that are part of D b k . Even though SAT solvers expect their input in CNF, we do not present the encodings in CNF to simplify the presentation, but they can be converted to CNF using simple equivalence transformations.
Most of our encodings are incremental in the sense that the formula for call k is constructed by only adding clauses to the formula for call k − 1. In other words, for substitution encodings we have In these cases, it is thus enough to encode the incremental additions for each call to the SAT solver. Some of our encodings, however, introduce clauses that are valid only for a specific bound b k and thus become invalid for larger bounds. We handle the deactivation of these encodings with selector variables as is common in incremental SAT solving.
Our encodings are correct in the following sense.

Substitutions
We encode substitutions by defining for each variable x ∈ Γ the characters to which each of x's positions is mapped. Specifically, given x and its corresponding upper bound b(x), we represent the substitution h(x) by introducing new vari- . We call these variables filler variables and we denote the set of all filler variables byΓ . By introducing a new symbol λ ̸ ∈ Σ, which stands for an unused filler variable, we can define h based on a substitutionȟ :Γ → Σ λ over the filler variables, where We use this representation of substitutions (known as "filling the positions" [18]) because it has a straightforward propositional encoding: For each variable x ∈ Γ and each position i ∈ 1, . . . , b(x), we create a set {h a We then use a propositional encoding of an exactly-one (EO) constraint (e.g., [20]) to assert that exactly one variable in this set must be true: Constraint (2) prevents the SAT solver from considering filled substitutions that are equivalent modulo λ-substitutions-it enforces that if a position i is mapped to λ, all following positions are mapped to λ too. For instance, abλλ, aλbλ, and λλab all correspond to the same word ab, but our encoding allows only abλλ. Thus, every Boolean assignment ω that satisfies h b encodes exactly one substitution h ω , and for every substitution h (bounded by b) there exists a corresponding assignment ω h that satisfies h b .

Theory Literals
The only theory literals of our core language are regular constraints (x . ∈ R) and variable equations (x . = y) with their negations. Constant equations (x . = w) as well as prefix and suffix constraints (w . ⊑ x and w . ⊒ x) could be expressed as regular constraints, but we encode them explicitly to improve performance.

Regular Constraints
We encode a regular constraint x . ∈ R by constructing a propositional formula that is true if and only if the word h(x) is accepted by a specific nondeterministic finite automaton that accepts the language L(R). Let x . ∈ R be a regular constraint and let M = (Q, Σ, δ, q 0 , F ) be a nondeterministic finite automaton (with states Q, alphabet Σ, transition relation δ, initial state q 0 , and accepting states F ) that accepts L(R) and that additionally allows λ-selftransitions on every state. Given that λ is a placeholder for the empty symbol, λ-transitions do not change the language accepted by M . We allow them so that M performs exactly b(x) transitions, even for substitutions of length less than b(x). This reduces checking whether the automaton accepts a word to only evaluating the states reached after exactly b(x) transitions. Given The formula captures all possible forward moves from each state. We must also ensure that a state is reachable only if it has a reachable predecessor, which we encode with the following formula, where pred(q ′ ) = {(q, a) | q ′ ∈ δ(q, a)}: The formula states that if state q ′ is reachable after i ≥ 1 transitions, then there must be a reachable predecessor state q ∈δ To decide whether the automaton accepts h ω (x), we encode that it must reach an accepting state after b k (x) transitions. Our corresponding encoding is only valid for the particular bound b k (x). To account for this, we introduce a fresh selector variable s k and define accept x . In the k-th call to the SAT solver and all following calls with the same bound on x, we solve under the assumption that s k is true. In the first call k ′ with b k (x) < b k ′ (x), we re-encode the condition using a new selector variable s k ′ and solve under the assumption that s k is false and s ′ k is true. The full encoding of the regular constraint x . ∈ R is thus given by Variable Equations Let x, y ∈ Γ be two string variables, let l = min(b k−1 (x), b k−1 (y)), and let u = min(b k (x), b k (y)). We encode equality between x and y with respect to b k position-wise up to u: The formula asserts that for each position i ∈ l + 1, . . . , u, if x[i] is mapped to a symbol, then y[i] is mapped to the same symbol (including λ). Since our encoding of substitutions ensures that every position in a string variable is mapped to exactly one character, = y, we encode that h(x) and h(y) must disagree on at least one position, which can happen either because they map to different symbols or because the variable with the higher bound is mapped to a longer word. As for the regular constraints, we again use selector variable s k to deactivate the encoding for all later bounds, for which it will be re-encoded:

Constant Equations Given a constant equation x .
= w, if the upper bound of x is less than |w|, the atom is trivially unsatisfiable. Thus, for all i such that b i (x) < |w|, we encode x . = w with a simple literal ¬s x,w and add s x,w to the assumptions. For b k (x) ≥ |w|, the encoding is based on the value of b k−1 (x): , then only the empty suffix has to be ensured.
Conversely, for an inequality x ̸ . = w, if b k (x) < |w|, then any substitution trivially is a solution, which we simply encode with ⊤. Otherwise, we introduce a selector variable s ′ x,w and define

Prefix and Suffix Constraints A prefix constraint w
. ⊑ x expresses that the first |w| positions of x must be mapped exactly onto w. As with equations between a variable x and a constant word w, we could express this as a regular constraint of the form x . ∈ w·? * . However, we achieve a more efficient encoding simply by dropping from the encoding of x . = w the assertion that the suffix of x starting at |w + 1| be empty. Accordingly, a negated prefix constraint w ̸ ∈ R that occur in φ. We can characterize the solutions to all these constraints by a single nondeterministic finite automaton M i . If the constraints on where L(R) denotes the complement of L(R). We say that M i accepts the regular constraints on x i in φ. If there are no such constraints on x i , then M i is the one-state NFA that accepts the full language Σ * . Let Q i denote the set of states of M i . If we do not take inequalities into account and if the regular constraints on x i are satisfiable, then a shortest solution h has length |h(x i )| ≤ |Q i |.
Theorem 6.1 gives a bound for the general case with variable inequalities. Intuitively, we prove the theorem by constructing a single automaton P that takes as input a vector of words W = (w 1 , ..., w n ) T and accepts W iff the substitution h W with h W (x i ) = w i satisfies φ. To construct P, we introduce one two-state NFA for each inequality and we then form the product of these NFAs with (slightly modified versions of) the NFAs M 1 , . . . , M n . We can then derive the bound of a shortest solution from the number of states of P. Theorem 6.1. Let φ be a conjunctive formula in normal form over variables be an NFA that accepts the regular constraints on x i in φ and let k be the number of inequalities occurring in φ. If φ is satisfiable, then it has a model h such that Proof. Let λ be a symbol that does not belong to Σ and define Σ λ = Σ ∪{λ}. As previously, we use λ to extend words of Σ * by padding. Given a word w ∈ Σ * λ , we denote byŵ the word of Σ * obtained by removing all occurrences of λ from w. We say that w is well-formed if it can be written as w = v · λ t with v ∈ Σ * and t ≥ 0. In this case, we haveŵ = v. Thus a well-formed word w consists of a prefix in Σ * followed by a sequence of λs.
Let ∆ be the alphabet Σ n λ , i.e., the letters of ∆ are the n-letter words over Σ λ . We can then represent a letter u of ∆ as an n-element vector (u 1 , . . . , u n ), and a word W of ∆ t can be written as an n × t matrix where u ij ∈ Σ λ . Each column of this matrix is a letter in ∆ and each row is a word in Σ t λ . We denote by p i (W ) the i-th row of this matrix and byp i (W ) = p i (W ) the word p i (W ) with all occurrences of λ removed. We say that W is well-formed if the words p 1 (W ), . . . , p n (W ) are all well-formed. Given a well-formed word W , we can construct a mapping h W : To prove the theorem, we build an NFA P with alphabet ∆ such that a wellformed word W is accepted by P iff h W satisfies φ. The shortest well-formed W accepted by P has length no more than the number of states of P and the bound will follow.
We first extend the NFA i has the same set of states, initial state, and final states as M i . Its transition relation δ ′ i is defined by One can easily check that M ′ i accepts a word W iff M i acceptsp i (W ). For an inequality x i ̸ . = x j , we construct an NFA D i,j = ({e, d}, ∆, δ, e, {d}) with transition function defined as follows: This NFA has two states. It starts in state e (for "equal") and stays in e as long as the characters u i and u j are equal. It transitions to state d (for "different") on the first u where u i ̸ = u j and stays in state d from that point. Since d is the final state, a word W is accepted by We define P to be the product of the NFAs M ′ 1 , . . . , M ′ n and D i1,j1 , . . . , D i k ,j k . A well-formed word W is accepted by P if it is accepted by all M ′ i and all D it,jt , which means that P accepts a well-formed word W iff h W satisfies φ.
Let P be the set of states of P. We then have |P | ≤ 2 k × |Q 1 | × . . . × |Q n |. Assume φ is satisfiable, so P accepts a well-formed word W . The shortest wellformed word accepted by P has an accepting run that does not visit the same state twice. So the length of this well-formed word W is no more than |P |. The mapping h W satisfies φ and for every The bound given by Theorem 6.1 holds if φ is in normal form but it also holds for a general conjunctive formula ψ. This follows from the observation that converting conjunctive formulas to normal form preserves the length of solutions.
In particular, we convert ψ ∧ x . = y to formula ψ ′ = ψ[x := y] so x does not occur in ψ ′ , but clearly, a bound for y in ψ ′ gives us the same bound for x in ψ.
In practice, before we apply the theorem we decompose the conjunctive formula φ into subformulas that have disjoint sets of variables. We write φ as φ 1 ∧ . . . ∧ φ m where the conjuncts have no common variables. Then, φ is satisfiable if each conjunct φ t is satisfiable and we derive upper bounds on the shortest solution for the variables of φ t , which gives more precise bounds than deriving bounds from φ directly. In particular, if a variable x i of ψ does not occur in any inequality, then the bound on |h(x i )| is |Q i |.
Theorem 6.1 only holds for conjunctive formulas. For an arbitrary (nonconjunctive) formula ψ, a generalization is to convert ψ into disjunctive normal form. Alternatively, it is sufficient to enumerate the subsets of lits(ψ). Given a subset A of lits(ψ), let us denote by d A a mapping that bounds the length of solutions to A, i.e., any solution h to A satisfies |h(x)| ≤ d A (x). This mapping d A can be computed from Theorem 6.1. The following property gives a bound for ψ.

Proposition 6.2. If ψ is satisfiable, then it has a model h such that for all
Proof. We can assume that ψ is in negation normal form. We can then convert ψ to disjunctive normal form ψ ⇔ ψ 1 ∨ · · · ∨ ψ n and we have lits(ψ i ) ⊆ lits(ψ). Also, ψ is satisfiable if and only if at least one ψ i is satisfiable and the proposition follows.
⊓ ⊔ Since there are 2 |lits(ψ)| subsets of lits(ψ), a direct application of Proposition 6.2 is rarely feasible in practice. Fortunately, we can use unsatisfiable cores to reduce the number of subsets to consider.

Unsatisfiable-Core Analysis
Instead of calculating the bounds upfront, we use the unsatisfiable core produced by the SAT solver after each incremental call to evaluate whether the upper bounds on the variables exceed the upper bounds of the shortest solution. If ψ b is unsatisfiable for bounds b, then it has an unsatisfiable core Cā with (possibly empty) subsets of clauses and ¬ d(a) → ¬a b to be in CNF. Let C + = {a | C a ̸ = ∅} and C − = {¬a | Cā ̸ = ∅} be the sets of literals whose encodings contain at least one clause of the core C. Using these sets, we construct the formula which consists of the conjunction of the abstraction and the definitions of the literals that are contained in C + , respectively C − . Recall that ψ is equisatisfiable to the conjunction ψ A ∧ d∈D d of the abstraction and all definitions in D. Let ψ ′ denote this formula, i.e., The following proposition shows that it suffices to refine the bounds according to ψ C . Proposition 6.3. Let ψ be unsatisfiable with respect to b and let C be an unsatisfiable core of ψ b . Then, ψ C is unsatisfiable with respect to b and ψ ′ |= ψ C .
Proof. By definition, we have We also have ψ ′ |= ψ C since C + ⊆ atoms + (ψ) and C − ⊆ atoms − (ψ). ⊓ ⊔ Applying Proposition 6.2 to ψ C results in the upper bounds of the shortest solution h C for ψ C . If |h C (x)| ≤ b(x) holds for all x ∈ Γ , then ψ C has no solution and unsatisfiability of ψ ′ follows from Proposition 6.3. Because ψ and ψ ′ are equisatisfiable, we can conclude that ψ is unsatisfiable. Otherwise, we increase the bounds on the variables that occur in ψ C while keeping bounds on the other variables unchanged: We construct b k+1 with b k (x) ≤ b k+1 (x) ≤ |h C (x)| for all x ∈ Γ , such that b k (y) < b k+1 (y) holds for at least one y ∈ V (ψ C ). By strictly increasing at least one variable's bound, we eventually either reach the upper bounds of ψ C and return unsatisfiability, or we eliminate it as an unsatisfiable implication of ψ. As there are only finitely many possibilities for C and thus for ψ C , our procedure is guaranteed to terminate.
We do not explicitly construct formula ψ C to compute bounds on h C as we know the set lits(ψ C ) = C + ∪ C − . Finding upper bounds still requires enumerating all subsets of lits(ψ C ), but we have |lits(ψ C )| ≤ |lits(ψ)| and usually lits(ψ C ) is much smaller than lits(ψ). For example, consider the formula ∈ ab·? * which is unsatisfiable for the bounds b(x) = b(y) = 1 and b(z) = 4.
The unsatisfiable core C returned after solving ψ b results in the formula ∈ ab·? * containing four literals. Finding upper bounds for ψ C thus amounts to enumerating just 2 4 subsets, which is substantially less than considering all 2 7 subsets of lits(ψ) upfront. The conjunction of a subset of lits(ψ C ) yielding the largest upper bounds is x ∈ ab·? * , which simplifies to x . ∈ ab * ∩ ab·? * and has a solution of length at most 2 for x and y. With bounds b(x) = b(y) = 2 and b(z) = 4, the formula is satisfiable.

Implementation
We have implemented our approach in a solver called nfa2sat. nfa2sat is written in Rust and uses CaDiCaL [9] as the backend SAT solver. We use the incremental API provided by CaDiCaL to solve problems under assumptions. Soundness of nfa2sat follows from Theorem 5.1. For completeness, we rely on CaDiCaL's failed function to efficiently determine failed assumptions, i.e., assumption literals that were used to conclude unsatisfiability.
The procedure works as follows. Given a formula ψ, we first introduce one fresh Boolean selector variable s l for each theory literal l ∈ lits(ψ). Then, instead of adding the encoded definitions of the theory literals directly to the SAT solver, we precede them with their corresponding selector variables: for a positive literal a, we add s a → (d(a) → a ), and for a negative literal ¬a, we add s ¬a → (¬ d(a) → ¬a ) (considering assumptions introduced by a as unit clauses). In the resulting CNF formula, the new selector variables are present in all clauses that encode their corresponding definition, and we use them as assumptions for every incremental call to the SAT solver, which does not affect satisfiability. If such an assumption failed, then we know that at least one of the corresponding clauses in the propositional formula was part of an unsatisfiable core, which enables us to efficiently construct the sets C + and C − of positive and negative atoms present in the unsatisfiable core. As noted previously, we have lits(ψ C ) = C + ∪ C − and hence the sets are sufficient to find bounds on a shortest model for ψ C .
This approach is efficient for obtaining lits(ψ C ) but since CaDiCaL does not guarantee that the set of failed assumptions is minimal, lits(ψ C ) is not minimal in general. Moreover, even a minimal lits(ψ C ) can contain too many elements for processing all subsets. To address this issue, we enumerate the subsets only if lits(ψ C ) is small (by default, we use a limit of ten literals). In this case, we construct the automata M i used in Theorem 6.1 for each subset, facilitating the techniques described in [7] for quickly ruling out unsatisfiable ones. Otherwise, instead of enumerating the subsets, we resort to sound approximations of upper bounds, which amounts to over-approximating the number of states without explicitly constructing the automata (c.f. [14]).
Once we have obtained upper bounds on the length of the solution of ψ C , we increment bounds on all variables involved, except those that have reached their maximum. Our default heuristics computes a new bound that is either double the current bound of a variable or its maximum, whichever is smaller.

Experimental Evaluation
We have evaluated our solver on a large set of benchmarks from the ZaligVinder [22] repository 4 . The repository contains 120,287 benchmarks stemming from both academic and industrial applications. In particular, all the string problems from the SMT-LIB repository, 5 are included in the ZaligVinder repository. We converted the ZaligVinder problems to the SMT-LIB 2.6 syntax and removed duplicates. This resulted in 82,632 unique problems out of which 29,599 are in the logical fragment we support. We compare nfa2sat with the state-of-the-art solvers cvc5 (version 1.0.3) and Z3 (version 4.12.0). The comparison is limited to these two solvers because they are widely adopted and because they had the best performance in our evaluation. Other string solvers either don't support our logical fragment (CertiStr, Woorpje) or gave incorrect answers on the benchmark problems considered here. Older, no-longer maintained, solvers have known soundness problems, as reported in [7] and [27].
We ran our experiment on a Linux server, with a timeout of 1200 seconds CPU time and a memory limit of 16 GB. Table 1 shows the results. As a single tool, nfa2sat solves more problems than cvc5 but not as many as Z3. All three tools solve more than 98% of the problems.
The table also shows results of portfolios that combine two solvers. In a portfolio configuration, the best setting is to use both Z3 and nfa2sat. This combination solves all but 20 problems within the timeout. It also reduces the total run-time from 283,942 seconds for Z3 (about 79 hours) to 28,914 seconds for the portfolio (about 8 hours), that is, a 90% reduction in total solve time. The other two portfolios-namely, Z3 with cvc5 and nfa2sat with cvc5-also have better performance than a single solver, but the improvement in runtime and number of timeouts is not as large.  The left plots include all problems, the middle plots include only satisfiable problems, and the right plots include only unsatisfiable problems. The lines marked "failed" correspond to problems that are not solved because a solver ran out of memory. The lines marked "timeout" correspond to problems not solved because of a timeout (1200 seconds). Figure 4a illustrates why nfa2sat and Z3 complement each other well. The figure shows three scatter plots that compare the runtime of nfa2sat and Z3 on our problems. The plot on the left compares the two solvers on all problems, the one in the middle compares them on satisfiable problems, and the one on the right compares them on unsatisfiable problems. Points in the left plot are concentrated close to the axes, with a smaller number of points near the diagonal, meaning that Z3 and nfa2sat have different runtime on most problems. The other two plots show this even more clearly: nfa2sat is faster on satisfiable problems while Z3 is faster on unsatisfiable problems. Figure 4b shows analogous scatter plots comparing nfa2sat and cvc5. The two solvers show similar performance on a large set of easy benchmarks although cvc5 is faster on problems that both solvers can solve in less than 1 second. However, cvc5 times out on 38 problems that nfa2sat solves in less than 2 seconds. On unsatisfiable problems, cvc5 tends to be faster than nfa2sat, but there is a class of problems for which nfa2sat takes between 10 and 100 seconds whereas cvc5 is slower.
Overall, the comparison shows that nfa2sat is competitive with cvc5 and Z3 on these benchmarks. We also observe that nfa2sat tends to work better on satisfiable problems. For best overall performance, our experiments show that a portfolio of Z3 and nfa2sat would solve all but 20 problems within the timeout, and reduce the total solve time by 90%.

Conclusion
We have presented the first eager SAT-based approach to string solving that is both sound and complete for a reasonably expressive fragment of string theory. Our experimental evaluation shows that our approach is competitive with the state-of-the-art lazy SMT solvers Z3 and cvc5, outperforming them on satisfiable problems but falling behind on unsatisfiable ones. A portfolio that combines our approach with these solvers-particularly with Z3-would thus yield strong performance across both types of problems.
In future work, we plan to extend our approach to a more expressive logical fragment, including more general word equations. Other avenues of research include the adaption of model checking techniques such as IC3 [10] to string problems, which we hope would lead to better performance on unsatisfiable instances. A particular benefit of the eager approach is that it enables the use of mature techniques from the SAT world, especially for proof generation and parallel solving. Producing proofs of unsatisfiability is complex for traditional CDCL(T) solvers because of the complex rewriting and deduction rules they employ. In contrast, efficiently generating and checking proofs produced by SAT solvers (using the DRAT format [32]) is well-established and practicable. A challenge in this respect would be to combine unsatisfiability proofs from a SAT solver with proof that our reduction to SAT is sound. For parallel solving, we plan to explore the use of a parallel incremental solver (such as iLingeling [9]) as well as other possible ways to solve multiple bounds in parallel.