1 Introduction

Many problems in software verification require reasoning about strings. To tackle these problems, numerous string solvers—automated decision procedures for quantifier-free first-order theories of strings and string operations—have been developed over the last years. These solvers form the workhorse of automated-reasoning tools in several domains, including web-application security [19, 31, 33], software model checking [15], and conformance checking for cloud-access-control policies [2, 30].

The general theory of strings relies on deep results in combinatorics on words [5, 16, 23, 29]; unfortunately, the related decision procedures remain intractable in practice. Practical string solvers achieve scalability through a judicious mix of heuristics and restrictions on the language of constraints.

We present a new approach to string solving that relies on an eager reduction to the Boolean satisfiability problem (SAT), using incremental solving and unsatisfiable-core analysis for completeness and scalability. Our approach supports a theory that contains Boolean combinations of regular membership constraints and equality constraints on string variables, and captures a large set of practical queries [6].

Our solving method iteratively searches for satisfying assignments up to a length bound on each string variable; it stops and reports unsatisfiability when the search reaches computed upper bounds without finding a solution. Similar to the solver Woorpje [12], we formulate regular membership constraints as reachability problems in nondeterministic finite automata. By bounding the number of transitions allowed by each automaton, we obtain a finite problem that we encode into propositional logic. To cut down the search space of the underlying SAT problem, we perform an alphabet reduction step (SMT-LIB string constraints are defined over an alphabet of \(3\cdot 2^{16}\) letters and a naive reduction to SAT does not scale). Inspired by bounded model checking [8], we iteratively increase bounds and utilize an incremental SAT solver to solve the resulting series of propositional formulas. We perform an unsatisfiable-core analysis after each unsatisfiable incremental call to increase only the bounds of a minimal subset of variables until a theoretical upper bound is reached.

We have evaluated our solver on a large set of benchmarks. The results show that our SAT-based approach is competitive with state-of-the-art SMT solvers in the logical fragment that we support. It is particularly effective on satisfiable instances.

Closest to our work is the Woorpje solver [12], which also employs an eager reduction to SAT. Woorpje reduces systems of word equations with linear constraints to a single Boolean formula and calls a SAT solver. An extension can also handle regular membership constraints [21]. However, Woorpje does not handle the full language of constraints considered here and does not employ the reduction and incremental solving techniques that make our tool scale in practice. More importantly, in contrast to our solver, Woorpje is not complete—it does not terminate on unsatisfiable instances.

Other solvers such as Hampi [19] and Kaluza [31] encode string problems into constraints on fixed-size bit-vector, which can be solved by reduction to SAT. These tools support expressive constraints but they require a user-provided bound on the length of string variables.

Further from our work are approaches based on the lazy SMT paradigm, which tightly integrates dedicated, heuristic, theory solvers for strings using the CDCL(T) architecture (also called DPLL(T) in early papers). Solvers that follow this paradigm include Ostrich [11], Z3 [25], Z3str4 [24], cvc5 [3], Z3str3RE [7], Trau [1], and CertiStr [17]. Our evaluation shows that our eager approach is competitive with lazy solvers overall, but it also shows that combining both types of solvers in a portfolio is most effective. Our eager approach tends to perform best on satisfiable instances while lazy approaches work better on unsatisfiable problems.

2 Preliminaries

We assume a fixed alphabet \(\varSigma \) and a fixed set of variables \(\varGamma \). Words of \(\varSigma ^*\) are denoted by w, \(w'\), \(w''\), etc. Variables are denoted by \(\textsf{x}, \textsf{y}, \textsf{z}\). Our decision procedure supports the theory described in Fig. 1.

Fig. 1.
figure 1

Syntax: \(\textsf{x}\) and \(\textsf{y}\) denote string variables and w denotes a word of \(\varSigma ^*\). The symbol \(\text {?}\) is the wildcard character.

Atoms in this theory include regular membership constraints (or regular constraints for short) of the form \(\textsf{x}\overset{.}{\in }RE\), where RE is a regular expression, and variable equations of the form \(\textsf{x} \doteq \textsf{y}\). Concatenation is not allowed in equations.

Regular expressions are defined inductively using union, concatenation, intersection, and the Kleene star. Atomic regular expressions are constant words \(w\in \varSigma ^*\) and the wildcard character \(\text {?}\), which is a placeholder for an arbitrary symbol \(c \in \varSigma \). All regular expressions are grounded, meaning that they do not contain variables. We use the symbols and as a shorthand notation for negations of atoms using the respective predicate symbols. The following is an example formula in our language: .

Using our basic syntax, we can define additional relations, such as constant equations \(\textsf{x} \doteq w\), and prefix and suffix constraints, written \(w \overset{.}{\sqsubseteq }\textsf{x}\) and \(w \overset{.}{\sqsupseteq }\textsf{x}\), respectively. Even though these relations can be expressed as regular constraints (e.g., the prefix constraint \(\text {ab} \overset{.}{\sqsubseteq }\textsf{x}\) can be expressed as \(\textsf{x} \overset{.}{\in }\text {a} \cdot \text {b} \cdot \text {?}^*\)), we can generate more efficient reductions to SAT by encoding them explicitly.

This string theory is not as expressive as others, since it does not include string concatenation, but it still has important practical applications. It is used in the Zelkova tool described by Backes, et al. [2] to support analysis of AWS security policies. Zelkova is a major industrial application of SMT solvers [30].

Given a formula \(\psi \), we denote by \({{\,\mathrm{\textit{atoms}}\,}}(\psi )\) the set of atoms occurring in \(\psi \), by \(V(\psi )\) the set of variables occurring in \(\psi \), and by \(\varSigma (\psi )\) the set of constant symbols occurring in \(\psi \). We call \(\varSigma (\psi )\) the alphabet of \(\psi \). Similarly, given a regular expression R, we denote by \(\varSigma (R)\) the set of characters occurring in R. In particular, we have \(\varSigma (?) = \emptyset \).

We call a formula conjunctive if it is a conjunction of literals and we call it a clause if it is a disjunction of literals. We say that a formula is in normal form if it is a conjunctive formula without unnegated variable equations. Every conjunctive formula can be turned into normal form by substitution, i.e., by repeatedly rewriting \(\psi \wedge \textsf{x} \doteq \textsf{y}\) to \(\psi [\textsf{x} := \textsf{y}]\). If \(\psi \) is in negation normal form (NNF), meaning that the negation symbol occurs only directly in front of atoms, we denote by \({{\,\mathrm{\textit{lits}}\,}}(\psi )\) the set of literals occurring in \(\psi \). We say that an atom a occurs with positive polarity in \(\psi \) if \(a \in {{\,\mathrm{\textit{lits}}\,}}(\psi )\) and that it occurs with negative polarity in \(\psi \) if \(\lnot a \in {{\,\mathrm{\textit{lits}}\,}}(\psi )\); we denote the respective sets of atoms of \(\psi \) by \({{\,\mathrm{\textit{atoms}}\,}}^+(\psi )\) and \({{\,\mathrm{\textit{atoms}}\,}}^-(\psi )\). The notion of polarity can be extended to arbitrary formulas (not necessarily in NNF), intuitively by considering polarity in a formula’s corresponding NNF (see [26] for details).

The semantics of our language is standard. A regular expression R defines a regular language \(\mathcal {L}(R)\) over \(\varSigma \) in the usual way. An interpretation is a mapping (also called a substitution) \(h :\varGamma \rightarrow \varSigma ^*\) from string variables to words. Atoms are interpreted as usual, and a model (also called a solution) is an interpretation that makes a formula evaluate to true under the usual semantics of the Boolean connectives.

3 Overview

Our solving method is illustrated in Fig. 2. It first performs three preprocessing steps that generate a Boolean abstraction of the input formula, reduce the size of the input alphabet, and initialize bounds on the lengths of all string variables. After preprocessing, we enter an encode-solve-and-refine loop that iteratively queries a SAT solver with a problem encoding based on the current bounds and refines the bounds after each unsatisfiable solver call. We repeat this loop until either the propositional encoding is satisfiable, in which case we conclude satisfiability of the input formula, or each bound has reached a theoretical upper bound, in which case we conclude unsatisfiability.

Fig. 2.
figure 2

Overview of the solving process.

Fig. 3.
figure 3

Example of Boolean abstraction. The formula \(\psi \), whose expression tree is shown on the left, results in the Boolean abstraction illustrated on the right, where p, q, and r are fresh Boolean variables. We additionally get the definitions \(p \rightarrow \textsf{x} \overset{.}{\in }R_1\), \(q \leftrightarrow \textsf{y} \overset{.}{\in }R_2\), and \(r \leftrightarrow \textsf{z} \doteq w\). We use an implication (instead of an equivalence) for atom \(\textsf{x} \overset{.}{\in }R_1\) since it occurs only with positive polarity within \(\psi \).

Generating the Boolean Abstraction. We abstract the input formula \(\psi \) by replacing each theory atom \(a\in {{\,\mathrm{\textit{atoms}}\,}}(\psi )\) with a new Boolean variable \({{\,\mathrm{\textbf{d}}\,}}(a)\), and keep track of the mapping between a and \({{\,\mathrm{\textbf{d}}\,}}(a)\). This gives us a Boolean abstraction \(\psi _\mathcal {A}\) of \(\psi \) and a set \(\textbf{D}\) of definitions, where each definition expresses the relationship between an atom a and its corresponding Boolean variable \({{\,\mathrm{\textbf{d}}\,}}(a)\). If a occurs with only one polarity in \(\psi \), we encode the corresponding definition as an implication, i.e., as \({{\,\mathrm{\textbf{d}}\,}}(a) \rightarrow a\) or as \(\lnot {{\,\mathrm{\textbf{d}}\,}}(a) \rightarrow \lnot a\), depending on the polarity of a. Otherwise, if a occurs with both polarities, we encode it as an equivalence consisting of both implications. This encoding, which is based on ideas behind the well-known Plaisted-Greenbaum transformation [28], ensures that the formulas \(\psi \) and \(\psi _\mathcal {A} \wedge \bigwedge _{d\in \textbf{D}} d\) are equisatisfiable. An example is shown in Fig. 3.

Reducing the Alphabet. In the SMT-LIB theory of strings [4], the alphabet \(\varSigma \) comprises \(3 \cdot 2^{16}\) letters, but we can typically use a much smaller alphabet without affecting satisfiability. In Sect. 4, we show that using \(\varSigma (\psi )\) and one extra character per string variable is sufficient. Reducing the alphabet is critical for our SAT encoding to be practical.

Initializing Bounds. A model for the original first-order formula \(\psi \) is a substitution \(h:\varGamma \rightarrow \varSigma ^*\) that maps each string variable to a word of arbitrary length such that \(\psi \) evaluates to true. As we use a SAT solver to find such substitutions, we need to bound the lengths of strings, which we do by defining a bound function \(\textrm{b}: \varGamma \rightarrow \mathbb {N}\) that assigns an upper bound to each string variable. We initialize a small upper bound for each variable, relying on simple heuristics. If the bounds are too small, we increase them in a later refinement step.

Encoding, Solving, and Refining Bounds. Given a bound function \(\textrm{b}\), we build a propositional formula \({\llbracket \psi \rrbracket }^{\textrm{b}_{}}\) that is satisfiable if and only if the original formula \(\psi \) has a solution h such that \(|h(\textsf{x}) | \le \textrm{b}(\textsf{x})\) for all \(\textsf{x} \in \varGamma \). We encode \({\llbracket \psi \rrbracket }^{\textrm{b}_{}}\) as the conjunction \(\psi _\mathcal {A} \wedge {\llbracket \textbf{D}\rrbracket }^{\textrm{b}_{}} \wedge {\llbracket h\rrbracket }^{\textrm{b}_{}}\), where \(\psi _\mathcal {A}\) is the Boolean abstraction of \(\psi \), \({\llbracket \textbf{D}\rrbracket }^{\textrm{b}_{}}\) is an encoding of the definitions \(\textbf{D}\), and \({\llbracket h\rrbracket }^{\textrm{b}_{}}\) is an encoding of the set of possible substitutions. We discuss details of the encoding in Sect. 5. A key property is that it relies on incremental SAT solving under assumptions [13]. Increasing bounds amounts to adding new clauses to the formula \({\llbracket \psi \rrbracket }^{\textrm{b}_{}}\) and fixing a set of assumptions, i.e., temporarily fixing the truth values of a set of Boolean variables. If \({\llbracket \psi \rrbracket }^{\textrm{b}_{}}\) is satisfiable, we can construct a substitution h from a Boolean model \(\omega \) of \({\llbracket \psi \rrbracket }^{\textrm{b}_{}}\). Otherwise, we examine an unsatisfiable core (i.e., an unsatisfiable subformula) of \({\llbracket \psi \rrbracket }^{\textrm{b}_{}}\) to determine whether increasing the bounds may give a solution and, if so, to identify the variables whose bounds must be increased. In Sect. 6, we explain in detail how we analyze unsatisfiable cores, increase bounds, and conclude unsatisfiability.

4 Reducing the Alphabet

In many applications, the alphabet \(\varSigma \) is large—typically Unicode or an approximation of Unicode as defined in the SMT-LIB standard—but formulas use much fewer symbols (less than 100 symbols is common in our experiments). In order to check the satisfiability of a formula \(\psi \), we can restrict the alphabet to the symbols that occur in \(\psi \) and add one extra character per variable. This allows us to produce compact propositional encodings that can be solved efficiently in practice.

To prove that such a reduced alphabet A is sufficient, we show that a model \(h :\varGamma \rightarrow \varSigma ^*\) of \(\psi \) can be transformed into a model \(h' :\varGamma \rightarrow A^*\) of \(\psi \) by replacing characters of \(\varSigma \) that do not occur in \(\psi \) by new symbols—one new symbol per variable of \(\psi \). For example, suppose \(V(\psi ) = \{\textsf{x}_1, \textsf{x}_2\}\), \(\varSigma (\psi ) = \{\text {a}, \text {c}, \text {d}\}\), and h is a model of \(\psi \) such that \(h(\textsf{x}_1) = \text {abcdef}\) and \(h(\textsf{x}_2) = \text {abbd}\). We introduce two new symbols \(\alpha _1, \alpha _2 \in \varSigma \setminus \varSigma (\psi )\) , define \(h'(\textsf{x}_1) = \text {a}\alpha _1\text {cd}\alpha _1\alpha _1\) and \(h'(\textsf{x}_2) = \text {a}\alpha _2\alpha _2\text {d }\), and argue that \(h'\) is a model as well.

More generally, assume B is a subset of \(\varSigma \) and n is a positive integer such that \(|B| \le |\varSigma | - n\). We can then pick n distinct symbols \(\alpha _1,\ldots ,\alpha _n\) from \(\varSigma \setminus B\). Let A be the set \(B \cup \{ \alpha _1,\ldots ,\alpha _n\}\). We construct n functions \(f_1,\ldots ,f_n\) from \(\varSigma \) to A by setting \(f_i(a) = a\) if \(a\in B\), and \(f_i(a) = \alpha _i\) otherwise. We extend \(f_i\) to words of \(\varSigma ^*\) in the natural way: \(f_i(\varepsilon ) = \varepsilon \) and \(f_i(a \cdot w) = f_i(a) \cdot f_i(w)\). This construction satisfies the following property:

Lemma 4.1

Let \(f_1, \dots , f_n\) be mappings as defined above, and let \(i,j \in 1, \dots , n\) such that \(i \ne j\). Then, the following holds:

  1. 1.

    If a and b are distinct symbols of \(\varSigma \), then \(f_i(a) \ne f_j(b)\).

  2. 2.

    If w and \(w'\) are distinct words of \(\varSigma ^*\), then \(f_i(w) \ne f_j(w')\).

Proof

The first part is an easy case analysis. For the second part, we have that \(|f_i(w)| = |w|\) and \(|f_j(w')| = |w'|\), so the statement holds if w and \(w'\) have different lengths. Assume now that w and \(w'\) have the same length and let v be the longest common prefix of w and \(w'\). Since w and \(w'\) are distinct, we have that \(w = v \cdot a \cdot u\) and \(w'=v \cdot b \cdot u'\), where \(a\ne b\) are symbols of \(\varSigma \) and u and \(u'\) are words of \(\varSigma ^*\). By the first part, we have \(f_i(a) \ne f_j(b)\), so \(f_i(w)\) and \(f_j(w')\) must be distinct.    \(\square \)

The following lemma can be proved by induction on R.

Lemma 4.2

Let \(f_1, \dots , f_n\) be mappings as defined above and let R be a regular expression with \(\varSigma (R) \subseteq B\). Then, for all words \(w \in \varSigma ^*\) and all \(i \in 1, \dots , n\), \(w\in \mathcal {L}(R)\) if and only if \(f_i(w)\in \mathcal {L}(R)\).

Given a subset A of \(\varSigma \), we say that \(\psi \) is satisfiable in A if there is a model \(h :V(\psi ) \rightarrow A^*\) of \(\psi \). We can now prove the main theorem of this section, which shows how to reduce the alphabet while maintaining satisfiability.

Theorem 4.3

Let \(\psi \) be a formula with at most n string variables \(\textsf{x}_1,\ldots ,\textsf{x}_n\) such that \(|\varSigma (\psi ) |+n \le |\varSigma |\). Then, \(\psi \) is satisfiable if and only if it is satisfiable in an alphabet \(A\subseteq \varSigma \) of cardinality \(|A |=|\varSigma (\psi ) | + n\).

Proof

We set \(B = \varSigma (\psi )\) and use the previous construction. So the alphabet \(A = B \cup \{\alpha _1,\ldots ,\alpha _n\}\) has cardinality \(|\varSigma (\psi ) | + n\), where \(\alpha _1,\ldots \alpha _n\) are distinct symbols of \(\varSigma \setminus B\). We can assume that \(\psi \) is in disjunctive normal form, meaning that it is a disjunction of the form \(\psi = \psi _1 \vee \dots \vee \psi _m\), where each \(\psi _t\) is a conjunctive formula. If \(\psi \) is satisfiable, then one of the disjuncts \(\psi _k\) is satisfiable and we have \(\varSigma (\psi _k) \subseteq B\). We can turn \(\psi _k\) into normal form by eliminating all variable equalities of the form \(\textsf{x}_i \doteq \textsf{x}_j\) from \(\psi _k\), resulting in a conjunction \(\varphi _k\) of literals of the form \(\textsf{x}_i\overset{.}{\in }R\), , or . Clearly, for any \(A \subseteq \varSigma \), \(\varphi _k\) is satisfiable in A if and only if \(\psi _k\) is satisfiable in A.

Let \(h :V(\varphi _k) \rightarrow \varSigma ^*\) be a model of \(\varphi _k\) and define the mapping \(h' :V(\varphi _k) \rightarrow A^*\) as \(h'(\textsf{x}_i) = f_i(h(\textsf{x}_i))\). We show that \(h'\) is a model of \(\varphi _k\). Consider a literal l of \(\varphi _k\). We have three cases:

  • l is of the form \(\textsf{x}_i \overset{.}{\in }R\) where \(\varSigma (R) \subseteq \varSigma (\psi ) = B\). Since h satisfies \(\varphi _k\), we must have \(h(\textsf{x}_i)\in \mathcal {L}(R)\) so \(h'(\textsf{x}_i) = f_i(h(\textsf{x}_i))\) is also in \(\mathcal {L}(R)\) by Lemma 4.2.

  • l is of the form with \(\varSigma (R)\subseteq B\). Then, and we can conclude again by Lemma 4.2.

  • l is of the form . Since h satisfies \(\varphi _k\), we must have \(i\ne j\) and \(h(\textsf{x}_i) \ne h(\textsf{x}_j)\), which implies \(h'(\textsf{x}_i) = f_i(h(\textsf{x}_i)) \ne f_j(h(\textsf{x}_j)) = h'(\textsf{x}_j)\) by Lemma 4.1.

All literals of \(\varphi _k\) are then satisfied by \(h'\), hence \(\varphi _k\) is satisfiable in A and thus so is \(\psi _k\). It follows that \(\psi \) is satisfiable in A.    \(\square \)

The reduction presented here can be improved and generalized. For example, it can be worthwhile to use different alphabets for different variables or to reduce large character intervals to smaller sets.

5 Propositional Encodings

Our algorithm performs a series of calls to a SAT solver. Each call determines the satisfiability of the propositional encoding \({\llbracket \psi \rrbracket }^{\textrm{b}_{}}\) of \(\psi \) for some upper bounds \(\textrm{b}\). Recall that \({\llbracket \psi \rrbracket }^{\textrm{b}_{}} = \psi _\mathcal {A} \wedge {\llbracket h\rrbracket }^{\textrm{b}_{}} \wedge {\llbracket \textbf{D}\rrbracket }^{\textrm{b}_{}}\), where \(\psi _\mathcal {A}\) is the Boolean abstraction of \(\psi \), \({\llbracket h\rrbracket }^{\textrm{b}_{}}\) is an encoding of the set of possible substitutions, and \({\llbracket \textbf{D}\rrbracket }^{\textrm{b}_{}}\) is an encoding of the theory-literal definitions, both bounded by b. Intuitively, \({\llbracket h\rrbracket }^{\textrm{b}_{}}\) tells the SAT solver to “guess” a substitution, \({\llbracket \textbf{D}\rrbracket }^{\textrm{b}_{}}\) makes sure that all theory literals are assigned proper truth values according to the substitution, and \(\psi _\mathcal {A}\) forces the evaluation of the whole formula under these truth values.

Suppose the algorithm performs n calls and let \(\textrm{b}_k: \varGamma \rightarrow \mathbb {N}\) for \(k\in 1, \dots , n\) denote the upper bounds used in the k-th call to the SAT solver. For convenience, we additionally define \(\textrm{b}_{0}(\textsf{x}) = 0\) for all \(\textsf{x}\in \varGamma \). In the k-th call, the SAT solver decides whether \({\llbracket \psi \rrbracket }^{\textrm{b}_{k}}\) is satisfiable. The Boolean abstraction \(\psi _\mathcal {A}\), which we already discussed in Sect. 3, stays the same for each call. In the following, we thus discuss the encodings of the substitutions \({\llbracket h\rrbracket }^{\textrm{b}_{k}}\) and of the various theory literals \({\llbracket a\rrbracket }^{\textrm{b}_{k}}\) and \({\llbracket \lnot a\rrbracket }^{\textrm{b}_{k}}\) that are part of \({\llbracket \textbf{D}\rrbracket }^{\textrm{b}_{k}}\). Even though SAT solvers expect their input in CNF, we do not present the encodings in CNF to simplify the presentation, but they can be converted to CNF using simple equivalence transformations.

Most of our encodings are incremental in the sense that the formula for call k is constructed by only adding clauses to the formula for call \(k-1\). In other words, for substitution encodings we have \({\llbracket h\rrbracket }^{\textrm{b}_{k}} = {\llbracket h\rrbracket }^{\textrm{b}_{k-1}} \wedge \llbracket h\rrbracket _{\textrm{b}_{k-1}}^{\textrm{b}_{k}}\) and for literals we have \({\llbracket l\rrbracket }^{\textrm{b}_{k}} = {\llbracket l\rrbracket }^{\textrm{b}_{k-1}} \wedge {\llbracket l\rrbracket _{\textrm{b}_{k-1}}^{\textrm{b}_{k}}}\), with the base case \({\llbracket h\rrbracket }^{\textrm{b}_{0}} = {\llbracket l\rrbracket }^{\textrm{b}_{0}} = \top \). In these cases, it is thus enough to encode the incremental additions \({\llbracket l\rrbracket _{\textrm{b}_{k-1}}^{\textrm{b}_{k}}}\) and \({\llbracket h\rrbracket _{\textrm{b}_{k-1}}^{\textrm{b}_{k}}}\) for each call to the SAT solver. Some of our encodings, however, introduce clauses that are valid only for a specific bound \(\textrm{b}_k\) and thus become invalid for larger bounds. We handle the deactivation of these encodings with selector variables as is common in incremental SAT solving.

Our encodings are correct in the following sense.Footnote 1

Theorem 5.1

Let l be a literal and let \(\textrm{b}: \varGamma \rightarrow \mathbb {N}\) be a bound function. Then, l has a model that is bounded by \(\textrm{b}\) if and only if \({\llbracket h\rrbracket }^{\textrm{b}_{}} \wedge {\llbracket l\rrbracket }^{\textrm{b}_{}}\) is satisfiable.

5.1 Substitutions

We encode substitutions by defining for each variable \(\textsf{x} \in \varGamma \) the characters to which each of \(\textsf{x}\)’s positions is mapped. Specifically, given \(\textsf{x}\) and its corresponding upper bound \(\textrm{b}(\textsf{x})\), we represent the substitution \(h(\textsf{x})\) by introducing new variables \(\textsf{x}[1], \dots , \textsf{x}[{\textrm{b}(\textsf{x})}]\), one for each symbol \(h(\textsf{x})[i]\) of the word \(h(\textsf{x})\). We call these variables filler variables and we denote the set of all filler variables by \(\check{\varGamma }\). By introducing a new symbol , which stands for an unused filler variable, we can define h based on a substitution \(\check{h} :\check{\varGamma } \rightarrow \varSigma _\lambda \) over the filler variables, where \(\varSigma _\lambda = \varSigma \cup \{\lambda \}\):

$$\begin{aligned} h(\textsf{x})[i] = {\left\{ \begin{array}{ll} \varepsilon &{} \text {if } \check{h}(\textsf{x}[i]) = \lambda \\ \check{h}(\textsf{x}[i]) &{} \text {otherwise} \end{array}\right. } \end{aligned}$$

We use this representation of substitutions (known as “filling the positions” [18]) because it has a straightforward propositional encoding: For each variable \(\textsf{x}\in \varGamma \) and each position \(i \in 1, \dots , \textrm{b}(\textsf{x})\), we create a set \(\{h_{\textsf{x}[i]}^{a} \mid a \in \varSigma _\lambda \}\) of Boolean variables, where \(h_{\textsf{x}[i]}^{a}\) is true if \(\check{h}(\textsf{x}[i]) = a\). We then use a propositional encoding of an exactly-one (EO) constraint (e.g., [20]) to assert that exactly one variable in this set must be true:

$$\begin{aligned} \llbracket h\rrbracket _{\textrm{b}_{k-1}}^{\textrm{b}_{k}}= & {} \bigwedge _{\textsf{x} \in \varGamma }~\bigwedge _{i=\textrm{b}_{k-1}(\textsf{x})+1}^{\textrm{b}_{k}(\textsf{x})}\text {EO}(\{h_{\textsf{x}[i]}^{a} \mid a \in \varSigma _\lambda \}) \end{aligned}$$
(1)
$$\begin{aligned}&\wedge&\bigwedge _{\textsf{x} \in \varGamma }~\bigwedge _{i=\textrm{b}_{k-1}(\textsf{x})}^{\textrm{b}_{k}(\textsf{x})-1} h_{\textsf{x}[i]}^{\lambda } \rightarrow h_{\textsf{x}[i+1]}^{\lambda } \end{aligned}$$
(2)

Constraint (2) prevents the SAT solver from considering filled substitutions that are equivalent modulo \(\lambda \)-substitutions—it enforces that if a position i is mapped to \(\lambda \), all following positions are mapped to \(\lambda \) too. For instance, \(ab \lambda \lambda \), \(a \lambda b \lambda \), and \(\lambda \lambda a b\) all correspond to the same word ab, but our encoding allows only \(ab\lambda \lambda \). Thus, every Boolean assignment \(\omega \) that satisfies \(\llbracket h\rrbracket ^{\textrm{b}}\) encodes exactly one substitution \(h_\omega \), and for every substitution h (bounded by \(\textrm{b}\)) there exists a corresponding assignment \(\omega _h\) that satisfies \(\llbracket h\rrbracket ^{\textrm{b}}\).

5.2 Theory Literals

The only theory literals of our core language are regular constraints (\(\textsf{x} \overset{.}{\in }R\)) and variable equations (\(\textsf{x} \doteq \textsf{y}\)) with their negations. Constant equations (\(\textsf{x} \doteq w\)) as well as prefix and suffix constraints (\(w \overset{.}{\sqsubseteq }\textsf{x}\) and \(w \overset{.}{\sqsupseteq }\textsf{x}\)) could be expressed as regular constraints, but we encode them explicitly to improve performance.

Regular Constraints. We encode a regular constraint \(\textsf{x} \overset{.}{\in }R\) by constructing a propositional formula that is true if and only if the word \(h(\textsf{x})\) is accepted by a specific nondeterministic finite automaton that accepts the language \(\mathcal {L}(R)\). Let \(\textsf{x} \overset{.}{\in }R\) be a regular constraint and let \(M = (Q, \varSigma , \delta , q_0, F)\) be a nondeterministic finite automaton (with states Q, alphabet \(\varSigma \), transition relation \(\delta \), initial state \(q_0\), and accepting states F) that accepts \(\mathcal {L}(R)\) and that additionally allows \(\lambda \)-self-transitions on every state. Given that \(\lambda \) is a placeholder for the empty symbol, \(\lambda \)-transitions do not change the language accepted by M. We allow them so that M performs exactly \(\textrm{b}_{}(\textsf{x})\) transitions, even for substitutions of length less than \(\textrm{b}_{}(\textsf{x})\). This reduces checking whether the automaton accepts a word to only evaluating the states reached after exactly \(\textrm{b}_{}(\textsf{x})\) transitions.

Given a model \(\omega \models {\llbracket h\rrbracket }^{\textrm{b}_{}}\), we express the semantics of M in propositional logic by encoding which states are reachable after reading \(h_\omega (\textsf{x})\). To this end, we assign \(\textrm{b}(\textsf{x})\) + 1 Boolean variables \(\{S_q^0, S_q^1, \dots , S_q^{\textrm{b}(\textsf{x})}\}\) to each state \(q \in Q\) and assert that \(\omega _h(S_q^i) = 1\) if and only if q can be reached by reading prefix \(h_\omega (\textsf{x})[1..i]\). We encode this as a conjunction \(\llbracket (M;\textsf{x})\rrbracket = \llbracket \text {I}_{(M;\textsf{x})}\rrbracket \wedge \llbracket \text {T}_{(M;\textsf{x})}\rrbracket \wedge \llbracket \text {P}_{(M;\textsf{x})}\rrbracket \) of three formulas, modelling the semantics of the initial state, the transition relation, and the predecessor relation of M. We assert that the initial state \(q_0\) is the only state reachable after reading the prefix of length 0, i.e., \(\llbracket \text {I}_{(M;\textsf{x})}\rrbracket ^{b_1} = S_{q_0}^0 \wedge \bigwedge _{q \in Q\setminus \left\{ q_0\right\} } \lnot S_{q}^0\). The condition is independent of the bound on \(\textsf{x}\), thus we set \(\llbracket \text {I}_{(M;\textsf{x})}\rrbracket ^{b_k}_{b_{k-1}} = \top \) for all \(k>1\).

We encode the transition relation of M by stating that if M is in some state q after reading \(h_\omega (\textsf{x})[1..i]\), and if there exists a transition from q to \(q'\) labelled with an a, then M can reach state \(q'\) after \(i+1\) transitions if \(h_\omega (\textsf{x})[i+1] = a\). This is expressed in the following formula:

$$\begin{aligned} \llbracket \text {T}_{(M;\textsf{x})}\rrbracket ^{\textrm{b}_{k}}_{\textrm{b}_{k-1}} = \bigwedge _{i=\textrm{b}_{k-1}(\textsf{x})}^{\textrm{b}_{k}(\textsf{x})-1}\,\bigwedge _{(q,a) \in {{\,\textrm{dom}\,}}(\delta )}\,\bigwedge _{q' \in \delta (q, a)} (S_q^i \wedge h_{\textsf{x}[i+1]}^{a}) \rightarrow S_{q'}^{i+1} \end{aligned}$$

The formula captures all possible forward moves from each state. We must also ensure that a state is reachable only if it has a reachable predecessor, which we encode with the following formula, where \({{\,\textrm{pred}\,}}(q') = \{(q, a) \mid q' \in \delta (q, a)\}\):

$$\begin{aligned} \llbracket \text {P}_{(M;\textsf{x})}\rrbracket ^{\textrm{b}_{k}}_{\textrm{b}_{k-1}} = \bigwedge _{i=\textrm{b}_{k-1}(\textsf{x})+1}^{\textrm{b}_{k}(\textsf{x})}\,\bigwedge _{q'\in Q} (S_{q'}^i \rightarrow \bigvee _{(q,a)\in {{\,\textrm{pred}\,}}(q')} ( S_{q}^{i-1} \wedge h_{\textsf{x}[i]}^{a} )) \end{aligned}$$

The formula states that if state \(q'\) is reachable after \(i\ge 1\) transitions, then there must be a reachable predecessor state \(q \in \hat{\delta }(\{q_0\}, h_\omega (\textsf{x})[1..i-1])\) such that \(q' \in \delta (q, h_\omega (\textsf{x})[i])\).

To decide whether the automaton accepts \(h_\omega (\textsf{x})\), we encode that it must reach an accepting state after \(\textrm{b}_{k}(\textsf{x})\) transitions. Our corresponding encoding is only valid for the particular bound \(\textrm{b}_{k}(\textsf{x})\). To account for this, we introduce a fresh selector variable \(s_k\) and define \(\llbracket \text {accept}_{\textsf{x}\overset{.}{\in }M}\rrbracket ^{\textrm{b}_k}_{\textrm{b}_{k-1}} = s_k \rightarrow \bigvee _{q_f \in F} S_{q_f}^{\textrm{b}_{k}(\textsf{x})}\). Analogously, we define \(\llbracket \text {reject}_{\textsf{x}\overset{.}{\in }M}\rrbracket ^{\textrm{b}_k}_{\textrm{b}_{k-1}} = s_k \rightarrow \bigwedge _{q_f \in F} \lnot S_{q_f}^{\textrm{b}_{k}(\textsf{x})}\). In the k-th call to the SAT solver and all following calls with the same bound on \(\textsf{x}\), we solve under the assumption that \(s_k\) is true. In the first call \(k'\) with \(\textrm{b}_{k}(\textsf{x}) < \textrm{b}_{k'}(\textsf{x})\), we re-encode the condition using a new selector variable \(s_{k'}\) and solve under the assumption that \(s_k\) is false and \(s_k'\) is true. The full encoding of the regular constraint \(x \overset{.}{\in }R\) is thus given by

$$\llbracket \textsf{x} \overset{.}{\in }R\rrbracket _{\textrm{b}_{k-1}}^{\textrm{b}_{k}} = \llbracket (M;\textsf{x})\rrbracket _{\textrm{b}_{k-1}}^{\textrm{b}_{k}} \wedge \llbracket \text {accept}_{\textsf{x}\overset{.}{\in }M}\rrbracket ^{\textrm{b}_k}_{\textrm{b}_{k-1}}$$

and its negation is encoded as

figure l

Variable Equations. Let \(\textsf{x}, \textsf{y} \in \varGamma \) be two string variables, let \(l = \min (\textrm{b}_{k-1}(\textsf{x}), \textrm{b}_{k-1}(\textsf{y}))\), and let \(u = \min (\textrm{b}_{k}(\textsf{x}), \textrm{b}_{k}(\textsf{y}))\). We encode equality between \(\textsf{x}\) and \(\textsf{y}\) with respect to \(\textrm{b}_{k}\) position-wise up to u:

$$\begin{aligned} \llbracket \textsf{x} \doteq \textsf{y}\rrbracket _{\textrm{b}_{k-1}}^{\textrm{b}_{k}} = \bigwedge _{i=l+1}^{u} \bigwedge _{a\in \varSigma _\lambda } (h_{\textsf{x}[i]}^{a} \rightarrow h_{\textsf{y}[i]}^{a}). \end{aligned}$$

The formula asserts that for each position \(i\in l+1, \dots , u\), if \(\textsf{x}[i]\) is mapped to a symbol, then \(\textsf{y}[i]\) is mapped to the same symbol (including \(\lambda \)). Since our encoding of substitutions ensures that every position in a string variable is mapped to exactly one character, \(\llbracket \textsf{x} \doteq \textsf{y}\rrbracket _{\textrm{b}_{k-1}}^{\textrm{b}_{k}}\) ensures \(\textsf{x}[i] = \textsf{y}[i]\) for \(i \in l+1, \dots , u\). In conjunction with \({\llbracket \textsf{x} \doteq \textsf{y}\rrbracket }^{\textrm{b}_{k-1}}\), which encodes equality up to the l-th position, we have symbol-wise equality of \(\textsf{x}\) and \(\textsf{y}\) up to bound u. Thus, if \(\textrm{b}_{k}(\textsf{x}) = \textrm{b}_{k}(\textsf{y})\), then the formula ensures the equality of both variables. If \(\textrm{b}_{k}(\textsf{x}) > \textrm{b}_{k}(\textsf{y})\), we add \(h_{\textsf{x}[u+1]}^{\lambda }\) as an assumption to the solver to ensure \(\textsf{x}[i] = \lambda \) for \(i \in u+1,\dots ,\textrm{b}_{k}(\textsf{x})\) and, symmetrically, we add the assumption \(h_{\textsf{y}[u+1]}^{\lambda }\) if \(\textrm{b}_{k}(\textsf{y}) > \textrm{b}_{k}(\textsf{x})\).

For the negation , we encode that \(h(\textsf{x})\) and \(h(\textsf{y})\) must disagree on at least one position, which can happen either because they map to different symbols or because the variable with the higher bound is mapped to a longer word. As for the regular constraints, we again use selector variable \(s_k\) to deactivate the encoding for all later bounds, for which it will be re-encoded:

figure n

Constant Equations. Given a constant equation \(\textsf{x} \doteq w\), if the upper bound of \(\textsf{x}\) is less than \(|w |\), the atom is trivially unsatisfiable. Thus, for all i such that \(b_{i}(\textsf{x}) < |w |\), we encode \(\textsf{x}\doteq w\) with a simple literal \(\lnot s_{\textsf{x}, w}\) and add \(s_{\textsf{x}, w}\) to the assumptions. For \(\textrm{b}_{k}(\textsf{x}) \ge |w |\), the encoding is based on the value of \(\textrm{b}_{k-1}(\textsf{x})\):

$$\begin{aligned} \llbracket \textsf{x} \doteq w\rrbracket _{\textrm{b}_{k-1}}^{\textrm{b}_{k}} = {\left\{ \begin{array}{ll} \bigwedge \nolimits _{i=1}^{|w |}h_{\textsf{x}[i]}^{w[i]} &{} \text {if } \textrm{b}_{k-1}(\textsf{x})< |w | = \textrm{b}_{k}(\textsf{x}) \\ \bigwedge \nolimits _{i=1}^{|w |}h_{\textsf{x}[i]}^{w[i]} \wedge h_{\textsf{x}[|w |+1]}^{\lambda } &{} \text {if } \textrm{b}_{k-1}(\textsf{x})< |w |< \textrm{b}_{k}(\textsf{x}) \\ h_{\textsf{x}[|w |+1]}^{\lambda } &{} \text {if } \textrm{b}_{k-1}(\textsf{x}) = |w |< \textrm{b}_{k}(\textsf{x}) \\ \top &{} \text {if } |w | < \textrm{b}_{k-1}(\textsf{x}) \end{array}\right. } \end{aligned}$$

If \(\textrm{b}_{k-1}(\textsf{x})< |w |\), then equality is encoded for all positions \(1, \dots , |w |\). Additionally, if \(\textrm{b}_{k}(\textsf{x}) > |w |\), we ensure that the suffix of \(\textsf{x}\) is empty starting from position \(|w |+1\). If \(\textrm{b}_{k-1}(\textsf{x}) = |w | < \textrm{b}_{k}(\textsf{x})\), then only the empty suffix has to be ensured. Lastly, if \(|w | < \textrm{b}_{k-1}(\textsf{x})\), then \({\llbracket \textsf{x} \doteq w\rrbracket }^{\textrm{b}_{k-1}} \Leftrightarrow {\llbracket \textsf{x} \doteq w\rrbracket }^{\textrm{b}_{k}}\).

Conversely, for an inequality , if \(\textrm{b}_{k}(\textsf{x}) < |w |\), then any substitution trivially is a solution, which we simply encode with \(\top \). Otherwise, we introduce a selector variable \(s_{\textsf{x}, w}'\) and define

figure p

If \(\textrm{b}_{k}(\textsf{x}) = |w |\), then a substitution h satisfies the constraint if and only if \(h(\textsf{x})[i] \ne w[i]\) for some \(i \in 1, \dots , |w |\). If \(\textrm{b}_{k}(\textsf{x}) > |w |\), in addition, h satisfies the constraint if \(|h(\textsf{x}) | > |w |\). Thus, if \(\textrm{b}_{k}(\textsf{x}) = |w |\), we perform solver call k under the assumption \(s_{\textsf{x}, w}'\), and if \(\textrm{b}_{k}(\textsf{x}) > |w |\), we perform it under the assumption \(\lnot s_{\textsf{x}, w}'\). Again, if \(|w | < \textrm{b}_{k-1}(\textsf{x})\), then .

Prefix and Suffix Constraints. A prefix constraint \(w \overset{.}{\sqsubseteq }\textsf{x}\) expresses that the first \(|w |\) positions of \(\mathsf {\textsf{x}}\) must be mapped exactly onto w. As with equations between a variable \(\textsf{x}\) and a constant word w, we could express this as a regular constraint of the form \(\textsf{x} \overset{.}{\in }w \cdot ?^*\). However, we achieve a more efficient encoding simply by dropping from the encoding of \(\llbracket \textsf{x} \doteq w\rrbracket \) the assertion that the suffix of \(\textsf{x}\) starting at \(|w+1 |\) be empty. Accordingly, a negated prefix constraint expresses that there is an index \(i \in 1, \dots , |w |\) such that the i-th position of \(\mathsf {\textsf{x}}\) is mapped onto a symbol different from w[i], which we encode by repurposing in a similar manner. Suffix constraints \(w \overset{.}{\sqsupseteq }\textsf{x}\) and can be encoded by analogous modifications to the encodings of \(\textsf{x} \doteq w\) and .

6 Refining Upper Bounds

Our procedure solves a series of SAT problems where the length bounds on string variables increase after each unsatisfiable solver call. The procedure terminates once the bounds are large enough so that further increasing them would be futile. To determine when this is the case, we rely on the upper bounds of a shortest solution to a formula \(\psi \). We call a model h of \(\psi \) a shortest solution of \(\psi \) if \(\psi \) has no model \(h'\) such that \(\sum _{\textsf{x} \in \varGamma } |h'(\textsf{x}) | < \sum _{\textsf{x} \in \varGamma } |h(\textsf{x}) |\). We first establish this bound for conjunctive formulas in normal form, where all literals are of the form , \(\textsf{x} \overset{.}{\in }R\), or . Once established, we show how the bound can be generalized to arbitrary formulas.

Let \(\varphi \) be a formula in normal form and let \(\textsf{x}_1,\ldots ,\textsf{x}_n\) be the variables of \(\varphi \). For each variable \(\textsf{x}_i\), we can collect all the regular constraints on \(\textsf{x}_i\), that is, all the literals of the form \(\textsf{x}_i\overset{.}{\in }R\) or that occur in \(\varphi \). We can characterize the solutions to all these constraints by a single nondeterministic finite automaton \(M_i\). If the constraints on \(\textsf{x}_i\) are then \(M_i\) is an NFA that accepts the regular language \(\bigcap _{t=1}^k \mathcal {L}(R_t) \cap \bigcap _{t=1}^l \overline{\mathcal {L}(R_t')}\), where \(\overline{\mathcal {L}(R)}\) denotes the complement of \(\mathcal {L}(R)\). We say that \(M_i\) accepts the regular constraints on \(\textsf{x}_i\) in \(\varphi \). If there are no such constraints on \(\textsf{x}_i\), then \(M_i\) is the one-state NFA that accepts the full language \(\varSigma ^*\). Let \(Q_i\) denote the set of states of \(M_i\). If we do not take inequalities into account and if the regular constraints on \(\textsf{x}_i\) are satisfiable, then a shortest solution h has length \(|h(\textsf{x}_i)| \le |Q_i|\).

Theorem 6.1 gives a bound for the general case with variable inequalities. Intuitively, we prove the theorem by constructing a single automaton \(\mathcal {P}\) that takes as input a vector of words \(W = (w_1, ..., w_n)^T\) and accepts W iff the substitution \(h_W\) with \(h_W(\textsf{x}_i) = w_i\) satisfies \(\varphi \). To construct \(\mathcal {P}\), we introduce one two-state NFA for each inequality and we then form the product of these NFAs with (slightly modified versions of) the NFAs \(M_1, \dots , M_n\). We can then derive the bound of a shortest solution from the number of states of \(\mathcal {P}\).

Theorem 6.1

Let \(\varphi \) be a conjunctive formula in normal form over variables \(\textsf{x}_1,\ldots ,\textsf{x}_n\). Let \(M_i=(Q_i,\varSigma ,\delta _i,q_{0,i},F_i)\) be an NFA that accepts the regular constraints on \(\textsf{x}_i\) in \(\varphi \) and let k be the number of inequalities occurring in \(\varphi \). If \(\varphi \) is satisfiable, then it has a model h such that

$$|h(\textsf{x}_i)| \le 2^k \times |Q_1| \times \ldots \times |Q_n|.$$

Proof

Let \(\lambda \) be a symbol that does not belong to \(\varSigma \) and define \(\varSigma _\lambda = \varSigma \cup \{\lambda \}\). As previously, we use \(\lambda \) to extend words of \(\varSigma ^*\) by padding. Given a word \(w \in \varSigma _\lambda ^*\), we denote by \(\hat{w}\) the word of \(\varSigma ^*\) obtained by removing all occurrences of \(\lambda \) from w. We say that w is well-formed if it can be written as \(w = v\cdot \lambda ^t\) with \(v\in \varSigma ^*\) and \(t\ge 0\). In this case, we have \(\hat{w} = v\). Thus a well-formed word w consists of a prefix in \(\varSigma ^*\) followed by a sequence of \(\lambda \)s.

Let \(\varDelta \) be the alphabet \(\varSigma _\lambda ^n\), i.e., the letters of \(\varDelta \) are the n-letter words over \(\varSigma _\lambda \). We can then represent a letter u of \(\varDelta \) as an n-element vector \((u_1, \ldots , u_n)\), and a word W of \(\varDelta ^t\) can be written as an \(n\times t\) matrix

$$W = \begin{pmatrix} u_{11} &{} \ldots &{} u_{t1} \\ \vdots &{} &{} \vdots \\ u_{1n} &{} \ldots &{} u_{tn} \\ \end{pmatrix} $$

where \(u_{ij} \in \varSigma _\lambda \). Each column of this matrix is a letter in \(\varDelta \) and each row is a word in \(\varSigma _\lambda ^t\). We denote by \(p_i(W)\) the i-th row of this matrix and by \(\hat{p_i}(W)=\widehat{p_i(W)}\) the word \(p_i(W)\) with all occurrences of \(\lambda \) removed. We say that W is well-formed if the words \(p_1(W),\ldots ,p_n(W)\) are all well-formed. Given a well-formed word W, we can construct a mapping \(h_W: \{ \textsf{x}_1,\ldots ,\textsf{x}_n\} \rightarrow \varSigma ^*\) by setting \(h_W(\textsf{x}_i) = \hat{p_i}(W)\) and we have \(|h_W(\textsf{x}_i)| \le |W| = t\).

To prove the theorem, we build an NFA \(\mathcal{P}\) with alphabet \(\varDelta \) such that a well-formed word W is accepted by \(\mathcal{P}\) iff \(h_W\) satisfies \(\varphi \). The shortest well-formed W accepted by \(\mathcal{P}\) has length no more than the number of states of \(\mathcal{P}\) and the bound will follow.

We first extend the NFA \(M_i=(Q_i, \varSigma ,\delta _i,q_{0,i},F_i)\) to an automaton \(M'_i\) with alphabet \(\varDelta \). \(M'_i\) has the same set of states, initial state, and final states as \(M_i\). Its transition relation \(\delta '_i\) is defined by

$$\begin{aligned} \delta '_i(q, u)= & {} \left\{ \begin{array}{ll} \delta _i(q, u_i) &{} ~{\text {if } u_i\in \varSigma } \\ \{ q \} &{} ~{\text {if } u_i = \lambda } \end{array} \right. \end{aligned}$$

One can easily check that \(M'_i\) accepts a word W iff \(M_i\) accepts \(\hat{p_i}(W)\).

For an inequality , we construct an NFA \(D_{i,j}=(\{e, d\}, \varDelta , \delta , e, \{ d \})\) with transition function defined as follows:

$$\begin{aligned} \delta (e, u)= & {} \{ e \}~~{\text {if } u_i = u_j}\\ \delta (e, u)= & {} \{ d \}~~{\text {if } u_i \ne u_j} \\ \delta (d, u)= & {} \{ d \}. \end{aligned}$$

This NFA has two states. It starts in state e (for “equal”) and stays in e as long as the characters \(u_i\) and \(u_j\) are equal. It transitions to state d (for “different”) on the first u where \(u_i \ne u_j\) and stays in state d from that point. Since d is the final state, a word W is accepted by \(D_{i,j}\) iff \(p_i(W)\ne p_j(W)\). If W is well-formed, we also have that W is accepted by \(D_{i,j}\) iff \(\hat{p_i}(W)\ne \hat{p_j}(W)\).

Let denote the k inequalities of \(\varphi \). We define \(\mathcal{P}\) to be the product of the NFAs \(M'_1,\ldots ,M'_n\) and \(D_{i_1,j_1},\ldots ,D_{i_k,j_k}\). A well-formed word W is accepted by \(\mathcal{P}\) if it is accepted by all \(M'_i\) and all \(D_{i_t,j_t}\), which means that \(\mathcal{P}\) accepts a well-formed word W iff \(h_W\) satisfies \(\varphi \).

Let P be the set of states of \(\mathcal{P}\). We then have \(|P| \le 2^k \times |Q_1|\times \ldots \times |Q_n|\). Assume \(\varphi \) is satisfiable, so \(\mathcal{P}\) accepts a well-formed word W. The shortest well-formed word accepted by \(\mathcal{P}\) has an accepting run that does not visit the same state twice. So the length of this well-formed word W is no more than |P|. The mapping \(h_W\) satisfies \(\varphi \) and for every \(\textsf{x}_i\), it satisfies \(|h_W(\textsf{x}_i)| = |\hat{p_i}(W)| \le |W| \le |P| \le 2^k \times |Q_1|\times \ldots \times |Q_n|\).    \(\square \)

The bound given by Theorem 6.1 holds if \(\varphi \) is in normal form but it also holds for a general conjunctive formula \(\psi \). This follows from the observation that converting conjunctive formulas to normal form preserves the length of solutions. In particular, we convert \(\psi \wedge \textsf{x}\doteq \textsf{y}\) to formula \(\psi ' = \psi [x:=y]\) so \(\textsf{x}\) does not occur in \(\psi '\), but clearly, a bound for \(\textsf{y}\) in \(\psi '\) gives us the same bound for \(\textsf{x}\) in \(\psi \).

In practice, before we apply the theorem we decompose the conjunctive formula \(\varphi \) into subformulas that have disjoint sets of variables. We write \(\varphi \) as \(\varphi _1 \wedge \ldots \wedge \varphi _m\) where the conjuncts have no common variables. Then, \(\varphi \) is satisfiable if each conjunct \(\varphi _t\) is satisfiable and we derive upper bounds on the shortest solution for the variables of \(\varphi _t\), which gives more precise bounds than deriving bounds from \(\varphi \) directly. In particular, if a variable \(\textsf{x}_i\) of \(\psi \) does not occur in any inequality, then the bound on \(|h(\textsf{x}_i)|\) is \(|Q_i|\).

Theorem 6.1 only holds for conjunctive formulas. For an arbitrary (non-conjunctive) formula \(\psi \), a generalization is to convert \(\psi \) into disjunctive normal form. Alternatively, it is sufficient to enumerate the subsets of \({{\,\mathrm{\textit{lits}}\,}}(\psi )\). Given a subset A of \({{\,\mathrm{\textit{lits}}\,}}(\psi )\), let us denote by \(d_A\) a mapping that bounds the length of solutions to A, i.e., any solution h to A satisfies \(|h(x)| \le d_A(x)\). This mapping \(d_A\) can be computed from Theorem 6.1. The following property gives a bound for \(\psi \).

Proposition 6.2

If \(\psi \) is satisfiable, then it has a model h such that for all \(\textsf{x}\in \varGamma \), it holds that \(|h(\textsf{x}) | \le \max \{d_A(x) \mid A \subseteq {{\,\mathrm{\textit{lits}}\,}}(\psi )\}\).

Proof

We can assume that \(\psi \) is in negation normal form. We can then convert \(\psi \) to disjunctive normal form \(\psi \Leftrightarrow \psi _1 \vee \dots \vee \psi _n\) and we have \({{\,\mathrm{\textit{lits}}\,}}(\psi _i) \subseteq {{\,\mathrm{\textit{lits}}\,}}(\psi )\). Also, \(\psi \) is satisfiable if and only if at least one \(\psi _i\) is satisfiable and the proposition follows.    \(\square \)

Since there are \(2^{|{{\,\mathrm{\textit{lits}}\,}}(\psi ) |}\) subsets of \({{\,\mathrm{\textit{lits}}\,}}(\psi )\), a direct application of Proposition 6.2 is rarely feasible in practice. Fortunately, we can use unsatisfiable cores to reduce the number of subsets to consider.

6.1 Unsatisfiable-Core Analysis

Instead of calculating the bounds upfront, we use the unsatisfiable core produced by the SAT solver after each incremental call to evaluate whether the upper bounds on the variables exceed the upper bounds of the shortest solution. If \({\llbracket \psi \rrbracket }^{\textrm{b}_{}}\) is unsatisfiable for bounds \(\textrm{b}\), then it has an unsatisfiable core

$$\begin{aligned} C = C_\mathcal {A} \wedge C_{h} \wedge \bigwedge _{a\in {{\,\mathrm{\textit{atoms}}\,}}^+(\psi )} C_a \wedge \bigwedge _{a\in {{\,\mathrm{\textit{atoms}}\,}}^-(\psi )} C_{\bar{a}} \end{aligned}$$

with (possibly empty) subsets of clauses \(C_\mathcal {A} \subseteq \psi _\mathcal {A}\), \(C_h \subseteq {\llbracket h\rrbracket }^{\textrm{b}_{}}\), \(C_a \subseteq ({{\,\mathrm{\textbf{d}}\,}}(a) \rightarrow {\llbracket a\rrbracket }^{\textrm{b}_{}})\), and \(C_{\bar{a}} \subseteq (\lnot {{\,\mathrm{\textbf{d}}\,}}(a) \rightarrow {\llbracket \lnot a\rrbracket }^{\textrm{b}_{}})\). Here we implicitly assume \(\psi _\mathcal {A}\), \({{\,\mathrm{\textbf{d}}\,}}(a) \rightarrow {\llbracket a\rrbracket }^{\textrm{b}_{}}\), and \(\lnot {{\,\mathrm{\textbf{d}}\,}}(a) \rightarrow {\llbracket \lnot a\rrbracket }^{\textrm{b}_{}}\) to be in CNF. Let \(\mathcal {C}^+ = \{a \mid C_a \ne \emptyset \}\) and \(\mathcal {C}^- = \{\lnot a \mid C_{\bar{a}} \ne \emptyset \}\) be the sets of literals whose encodings contain at least one clause of the core C. Using these sets, we construct the formula

$$\begin{aligned} \psi ^\mathcal {C} = \psi _\mathcal {A} \wedge \bigwedge _{a \in \mathcal {C}^+} {{\,\mathrm{\textbf{d}}\,}}(a) \rightarrow a \wedge \bigwedge _{\lnot a \in \mathcal {C}^-} \lnot {{\,\mathrm{\textbf{d}}\,}}(a) \rightarrow \lnot a\text {,} \end{aligned}$$

which consists of the conjunction of the abstraction and the definitions of the literals that are contained in \(\mathcal {C}^+\), respectively \(\mathcal {C}^-\). Recall that \(\psi \) is equisatisfiable to the conjunction \(\psi _\mathcal {A} \wedge \bigwedge _{d\in \textbf{D}} d\) of the abstraction and all definitions in \(\textbf{D}\). Let \(\psi '\) denote this formula, i.e.,

$$\begin{aligned} \psi '= & {} \psi _\mathcal {A} \wedge \bigwedge _{a\in {{\,\mathrm{\textit{atoms}}\,}}^+(\psi )} {{\,\mathrm{\textbf{d}}\,}}(a) \rightarrow a\wedge \bigwedge _{\lnot a \in {{\,\mathrm{\textit{atoms}}\,}}^-(\psi )} \lnot {{\,\mathrm{\textbf{d}}\,}}(a) \rightarrow \lnot a\text {.} \end{aligned}$$

The following proposition shows that it suffices to refine the bounds according to \(\psi ^\mathcal {C}\).

Proposition 6.3

Let \(\psi \) be unsatisfiable with respect to \(\textrm{b}\) and let C be an unsatisfiable core of \({\llbracket \psi \rrbracket }^{\textrm{b}_{}}\). Then, \(\psi ^\mathcal {C}\) is unsatisfiable with respect to \(\textrm{b}\) and \(\psi ' \models \psi ^\mathcal {C}\).

Proof

By definition, we have \({\llbracket \psi ^\mathcal {C}\rrbracket }^{\textrm{b}_{}} = \psi _\mathcal {A} \wedge {\llbracket h\rrbracket }^{\textrm{b}_{}} \wedge \bigwedge _{a\in \mathcal {C}^+} {{\,\mathrm{\textbf{d}}\,}}(a) \rightarrow {\llbracket a\rrbracket }^{\textrm{b}_{}} \wedge \bigwedge _{\lnot a \in \mathcal {C}^-} \lnot {{\,\mathrm{\textbf{d}}\,}}(a) \rightarrow \lnot {\llbracket \lnot a\rrbracket }^{\textrm{b}_{}}\). This implies \(C \subseteq {\llbracket \psi ^\mathcal {C}\rrbracket }^{\textrm{b}_{}}\) and, since C is an unsatisfiable core, \({\llbracket \psi ^\mathcal {C}\rrbracket }^{\textrm{b}_{}}\) is unsatisfiable. That is, \(\psi ^\mathcal {C}\) is unsatisfiable with respect to \(\textrm{b}\). We also have \(\psi ' \models \psi ^\mathcal {C}\) since \(\mathcal {C}^+ \subseteq {{\,\mathrm{\textit{atoms}}\,}}^+(\psi )\) and \(\mathcal {C}^- \subseteq {{\,\mathrm{\textit{atoms}}\,}}^-(\psi )\).    \(\square \)

Applying Proposition 6.2 to \(\psi ^\mathcal {C}\) results in the upper bounds of the shortest solution \(h_C\) for \(\psi ^\mathcal {C}\). If \(|h_C(\textsf{x}) | \le \textrm{b}_{}(\textsf{x})\) holds for all \(\textsf{x} \in \varGamma \), then \(\psi ^{C}\) has no solution and unsatisfiability of \(\psi '\) follows from Proposition 6.3. Because \(\psi \) and \(\psi '\) are equisatisfiable, we can conclude that \(\psi \) is unsatisfiable.

Otherwise, we increase the bounds on the variables that occur in \(\psi ^{C}\) while keeping bounds on the other variables unchanged: We construct \(\textrm{b}_{k+1}\) with \(\textrm{b}_{k}(\textsf{x}) \le \textrm{b}_{k+1}(\textsf{x}) \le |h_C(\textsf{x}) |\) for all \(\textsf{x} \in \varGamma \), such that \(\textrm{b}_{k}(\textsf{y}) < \textrm{b}_{k+1}(\textsf{y})\) holds for at least one \(\textsf{y} \in V(\psi ^\mathcal {C})\). By strictly increasing at least one variable’s bound, we eventually either reach the upper bounds of \(\psi ^\mathcal {C}\) and return unsatisfiability, or we eliminate it as an unsatisfiable implication of \(\psi \). As there are only finitely many possibilities for \(\mathcal {C}\) and thus for \(\psi ^\mathcal {C}\), our procedure is guaranteed to terminate.

We do not explicitly construct formula \(\psi ^\mathcal {C}\) to compute bounds on \(h_C\) as we know the set \({{\,\mathrm{\textit{lits}}\,}}(\psi ^\mathcal {C}) = \mathcal {C}^+ \cup \; \mathcal {C}^-\). Finding upper bounds still requires enumerating all subsets of \({{\,\mathrm{\textit{lits}}\,}}(\psi ^\mathcal {C})\), but we have \(|{{\,\mathrm{\textit{lits}}\,}}(\psi ^\mathcal {C}) | \le |{{\,\mathrm{\textit{lits}}\,}}(\psi ) |\) and usually \({{\,\mathrm{\textit{lits}}\,}}(\psi ^\mathcal {C})\) is much smaller than \({{\,\mathrm{\textit{lits}}\,}}(\psi )\). For example, consider the formula

figure ab

which is unsatisfiable for the bounds \(\textrm{b}(\textsf{x}) = \textrm{b}(\textsf{y}) = 1\) and \(\textrm{b}(\textsf{z}) = 4\). The unsatisfiable core C returned after solving \({\llbracket \psi \rrbracket }^{\textrm{b}_{}}\) results in the formula \({\psi ^\mathcal {C} = (\textsf{x} \doteq a \vee \textsf{x} \overset{.}{\in }ab^*) \wedge \textsf{x} \doteq \textsf{y} \wedge \textsf{y} \overset{.}{\in }ab\cdot ?^*}\) containing four literals. Finding upper bounds for \(\psi ^\mathcal {C}\) thus amounts to enumerating just \(2^4\) subsets, which is substantially less than considering all \(2^7\) subsets of \({{\,\mathrm{\textit{lits}}\,}}(\psi )\) upfront. The conjunction of a subset of \({{\,\mathrm{\textit{lits}}\,}}(\psi ^\mathcal {C})\) yielding the largest upper bounds is \(\textsf{x} \overset{.}{\in }ab^* \wedge \textsf{x} \doteq \textsf{y} \wedge \textsf{y} \overset{.}{\in }ab\cdot ?^*\), which simplifies to \(\textsf{x} \overset{.}{\in }ab^* \cap ab\cdot ?^*\) and has a solution of length at most 2 for \(\textsf{x}\) and \(\textsf{y}\). With bounds \(\textrm{b}(\textsf{x}) = \textrm{b}(\textsf{y}) = 2\) and \(\textrm{b}(\textsf{z}) = 4\), the formula is satisfiable.

7 Implementation

We have implemented our approach in a solver called nfa2sat. nfa2sat is written in Rust and uses CaDiCaL  [9] as the backend SAT solver. We use the incremental API provided by CaDiCaL to solve problems under assumptions. Soundness of nfa2sat follows from Theorem 5.1. For completeness, we rely on CaDiCaL’s failed function to efficiently determine failed assumptions, i.e., assumption literals that were used to conclude unsatisfiability.

The procedure works as follows. Given a formula \(\psi \), we first introduce one fresh Boolean selector variable \(s_l\) for each theory literal \(l \in {{\,\mathrm{\textit{lits}}\,}}(\psi )\). Then, instead of adding the encoded definitions of the theory literals directly to the SAT solver, we precede them with their corresponding selector variables: for a positive literal a, we add \(s_a \rightarrow ({{\,\mathrm{\textbf{d}}\,}}(a)\rightarrow \llbracket a\rrbracket )\), and for a negative literal \(\lnot a\), we add \(s_{\lnot a}\rightarrow (\lnot {{\,\mathrm{\textbf{d}}\,}}(a)\rightarrow \llbracket \lnot a\rrbracket )\) (considering assumptions introduced by \(\llbracket a\rrbracket \) as unit clauses). In the resulting CNF formula, the new selector variables are present in all clauses that encode their corresponding definition, and we use them as assumptions for every incremental call to the SAT solver, which does not affect satisfiability. If such an assumption failed, then we know that at least one of the corresponding clauses in the propositional formula was part of an unsatisfiable core, which enables us to efficiently construct the sets \(\mathcal {C}^+\) and \(\mathcal {C}^-\) of positive and negative atoms present in the unsatisfiable core. As noted previously, we have \({{\,\mathrm{\textit{lits}}\,}}(\psi ^C) = \mathcal {C}^+ \cup \mathcal {C}^-\) and hence the sets are sufficient to find bounds on a shortest model for \(\psi ^C\).

This approach is efficient for obtaining \({{\,\mathrm{\textit{lits}}\,}}(\psi ^C)\) but since CaDiCaL does not guarantee that the set of failed assumptions is minimal, \({{\,\mathrm{\textit{lits}}\,}}(\psi ^C)\) is not minimal in general. Moreover, even a minimal \({{\,\mathrm{\textit{lits}}\,}}(\psi ^C)\) can contain too many elements for processing all subsets. To address this issue, we enumerate the subsets only if \({{\,\mathrm{\textit{lits}}\,}}(\psi ^C)\) is small (by default, we use a limit of ten literals). In this case, we construct the automata \(M_i\) used in Theorem 6.1 for each subset, facilitating the techniques described in [7] for quickly ruling out unsatisfiable ones. Otherwise, instead of enumerating the subsets, we resort to sound approximations of upper bounds, which amounts to over-approximating the number of states without explicitly constructing the automata (c.f. [14]).

Once we have obtained upper bounds on the length of the solution of \(\psi ^C\), we increment bounds on all variables involved, except those that have reached their maximum. Our default heuristics computes a new bound that is either double the current bound of a variable or its maximum, whichever is smaller.

8 Experimental Evaluation

We have evaluated our solver on a large set of benchmarks from the ZaligVinder [22] repositoryFootnote 2. The repository contains 120,287 benchmarks stemming from both academic and industrial applications. In particular, all the string problems from the SMT-LIB repository,Footnote 3 are included in the ZaligVinder repository. We converted the ZaligVinder problems to the SMT-LIB 2.6 syntax and removed duplicates. This resulted in 82,632 unique problems out of which 29,599 are in the logical fragment we support.

We compare nfa2sat with the state-of-the-art solvers cvc5 (version 1.0.3) and Z3 (version 4.12.0). The comparison is limited to these two solvers because they are widely adopted and because they had the best performance in our evaluation. Other string solvers either don’t support our logical fragment (CertiStr, Woorpje) or gave incorrect answers on the benchmark problems considered here. Older, no-longer maintained, solvers have known soundness problems, as reported in [7] and [27].

We ran our experiment on a Linux server, with a timeout of 1200 s seconds CPU time and a memory limit of 16 GB. Table 1 shows the results. As a single tool, nfa2sat solves more problems than cvc5 but not as many as Z3. All three tools solve more than 98% of the problems.

The table also shows results of portfolios that combine two solvers. In a portfolio configuration, the best setting is to use both Z3 and nfa2sat. This combination solves all but 20 problems within the timeout. It also reduces the total run-time from 283,942 s for Z3 (about 79 h) to 28,914 s for the portfolio (about 8 h), that is, a 90% reduction in total solve time. The other two portfolios—namely, Z3 with cvc5 and nfa2sat with cvc5—also have better performance than a single solver, but the improvement in runtime and number of timeouts is not as large.

Table 1. Evaluation on ZaligVinder benchmarks. The three left columns show results of individual solvers. The other three columns show results of portfolios combining two solvers.
Fig. 4.
figure 4

Comparison of runtime (in seconds) with Z3 and cvc5. The left plots include all problems, the middle plots include only satisfiable problems, and the right plots include only unsatisfiable problems. The lines marked “failed” correspond to problems that are not solved because a solver ran out of memory. The lines marked “timeout” correspond to problems not solved because of a timeout (1200 s).

Figure 4a illustrates why nfa2sat and Z3 complement each other well. The figure shows three scatter plots that compare the runtime of nfa2sat and Z3 on our problems. The plot on the left compares the two solvers on all problems, the one in the middle compares them on satisfiable problems, and the one on the right compares them on unsatisfiable problems. Points in the left plot are concentrated close to the axes, with a smaller number of points near the diagonal, meaning that Z3 and nfa2sat have different runtime on most problems. The other two plots show this even more clearly: nfa2sat is faster on satisfiable problems while Z3 is faster on unsatisfiable problems. Figure 4b shows analogous scatter plots comparing nfa2sat and cvc5. The two solvers show similar performance on a large set of easy benchmarks although cvc5 is faster on problems that both solvers can solve in less than 1 s. However, cvc5 times out on 38 problems that nfa2sat solves in less than 2 s. On unsatisfiable problems, cvc5 tends to be faster than nfa2sat, but there is a class of problems for which nfa2sat takes between 10 and 100 s whereas cvc5 is slower.

Overall, the comparison shows that nfa2sat is competitive with cvc5 and Z3 on these benchmarks. We also observe that nfa2sat tends to work better on satisfiable problems. For best overall performance, our experiments show that a portfolio of Z3 and nfa2sat would solve all but 20 problems within the timeout, and reduce the total solve time by 90%.

9 Conclusion

We have presented the first eager SAT-based approach to string solving that is both sound and complete for a reasonably expressive fragment of string theory. Our experimental evaluation shows that our approach is competitive with the state-of-the-art lazy SMT solvers Z3 and cvc5, outperforming them on satisfiable problems but falling behind on unsatisfiable ones. A portfolio that combines our approach with these solvers—particularly with Z3—would thus yield strong performance across both types of problems.

In future work, we plan to extend our approach to a more expressive logical fragment, including more general word equations. Other avenues of research include the adaption of model checking techniques such as IC3 [10] to string problems, which we hope would lead to better performance on unsatisfiable instances. A particular benefit of the eager approach is that it enables the use of mature techniques from the SAT world, especially for proof generation and parallel solving. Producing proofs of unsatisfiability is complex for traditional CDCL(T) solvers because of the complex rewriting and deduction rules they employ. In contrast, efficiently generating and checking proofs produced by SAT solvers (using the DRAT format [32]) is well-established and practicable. A challenge in this respect would be to combine unsatisfiability proofs from a SAT solver with proof that our reduction to SAT is sound. For parallel solving, we plan to explore the use of a parallel incremental solver (such as iLingeling [9]) as well as other possible ways to solve multiple bounds in parallel.