figure a

1 Introduction

Sequences are an extension of strings, wherein elements might range over an infinite domain (e.g., integers, strings, and even sequences themselves). Sequences are ubiquitous and commonly used data types in modern programming languages. They come under different names, e.g., Python/Haskell/Prolog lists, Java ArrayList (and to some extent Streams) and JavaScript arrays. Crucially, sequences are extendable, and a plethora of operations (including append, map, split, filter, concatenation, etc.) can naturally be defined and are supported by built-in library functions in most modern programming languages.

Various techniques in software model checking [30] — including symbolic execution, invariant generation — require an appropriate SMT theory, to which verification conditions could be discharged. In the case of programs operating on sequences, we would consequently require an SMT theory of sequences, for which leading SMT solvers like Z3 [6, 38] and cvc5 [4] already provide some basic support for over a decade. The basic design of sequence theories, as done in Z3 and cvc5, as well as in other formalisms like symbolic automata [15], is in fact quite natural. That is, sequence theories can be thought of as extensions of theories of strings with an infinite alphabet of letters, together with a corresponding alphabet theory, e.g. Linear Integer Arithmetic (LIA) for reasoning about sequences of integers. Despite this, very little is known about what is decidable over theories of sequences.

In the case of finite alphabets, sequence theories become theories over strings, in which a lot of progress has been made in the last few decades, barring the long-standing open problem of string equations with length constraints (e.g. see [26]). For example, it is known that the existential theory of concatenation over strings with regular constraints is decidable (in fact, PSpace-complete), e.g., see [17, 29, 36, 40, 43]. Here, a regular constraint takes the form \(x \in L(E)\), where E is a regular expression, mandating that the expression E matches the string represented by x. In addition, several natural syntactic restrictions — including straight-line, acylicity, and chain-free (e.g. [1, 2, 5, 11, 12, 26, 35]) — have been identified, with which string constraints remain decidable in the presence of more complex string functions (e.g. transducers, replace-all, reverse, etc.). In the case of infinite alphabets, only a handful of results are available. Furia [25] showed that the existential theory of sequence equations over the alphabet theory of LIA is decidable by a reduction to the existential theory of concatenation over strings (over a finite alphabet) without regular constraints. Loosely speaking, a number (e.g. 4) can be represented as a string in unary (e.g. 1111), and addition is then simulated by concatenation. Therefore, his decidability result does not extend to other data domains and alphabet theories. Wang et al. [45] define an extension of the array property fragment [9] with concatenation. This fragment imposes strong restrictions, however, on the equations between sequences (here called finite arrays) that can be considered.

“Regular Constraints” Over Sequences. One answer of what a regular constraint is over sequences is provided by automata modulo theories. Automata modulo theories [15, 16] are an elegant framework that can be used to capture the notion of regular constraints over sequences: Fix an alphabet theory \(T\) that forms a Boolean algebra; this is satisfied by virtually all existing SMT theories. In this framework, one uses formulas in \(T\) to capture multiple (possibly infinitely many) transitions of an automaton. More precisely, between two states in a symbolic automaton one associates a unaryFootnote 1 formula \(\varphi (x) \in T\). For example, \(q \rightarrow _{\varphi } q'\) with \(\varphi := x \equiv 0 \pmod {2}\) over LIA corresponds to all transitions \(q \rightarrow _i q'\) with any even number i. Despite their nice properties, it is known that many simple languages cannot be captured using symbolic automata; e.g., one cannot express the language consisting of sequences containing the same even number i throughout the sequence.

There are essentially two (expressively incomparable) extensions of symbolic automata that address the aforementioned problem: (i) Symbolic Register Automata (SRA) [14] and (ii) Parametric Automata (PA) [21, 23, 24]. The model SRA was obtained by combining register automata [31] and symbolic automata. The model PA extends symbolic automata by allowing free variables (a.k.a. parameters) in the transition guards, i.e., the guard will be of the form \(\varphi (x,\bar{p})\), for parameters \(\bar{p}\). In an accepting path of PA, a parameter p used in multiple transitions has to be instantiated with the same value, which enables comparisons of different positions in an input sequence. For example, we can assert that only sequences of the form \(i^*\), for an even number i, are accepted by the PA with a single transition \(q \rightarrow _\varphi q\) with \(\varphi (x,p) := x = p \wedge x \equiv 0 \pmod {2}\) and q being the start and final state. PA can also be construed as an extension of both variable automata [27] and symbolic automata. SRA and PA are not comparable: while parameters can be construed as read-only registers, SRA can only compare two different positions using equality, while PA may use a general formula in the theory in such a comparison (e.g., order).

Contributions. The main contribution of this paper is to provide the first decidable fragments of a theory of sequences parameterized in the element theory. In particular, we show how to leverage string solvers to solve theories over sequences. We believe this is especially interesting, in view of the plethora of existing string solvers developed in the last 10 years (e.g. see the survey [3]). This opens up new possibilities for verification tasks to be automated; in particular, we show how verification conditions for Quicksort, as well as Bakery and Dijkstra protocols, can be captured in our sequence theory. This formalization was done in the style of regular model checking [8, 34], whose extension to infinite alphabets has been a longstanding challenge in the field. We also provide a new (dedicated) sequence solver SeCo We detail our results below.

We first show that the quantifier-free theory of sequences with concatenation and PA as regular constraints is decidable. Assuming that the theory is solvable in PSpace (which is reasonable for most SMT theories), we show that our algorithm runs in ExpSpace (i.e., double-exponential time and exponential space). We also identify conditions on the SMT theory \(T\) under which PSpace can be achieved and as an example show that Linear Real Arithmetic (LRA) satisfies those conditions. This matches the PSpace-completeness of the theory of strings with concatenation and regular constraints [18].

We consider three different variants/extensions:

  1. (i)

    Add length constraints. Length constraints (e.g., \(|{\textbf {x}}| = |{\textbf {y}}|\) for two sequence variables \({\textbf {x}},{\textbf {y}}\)) are often considered in the context of string theories, but the decidability of the resulting theory (i.e., strings with concatenation and length constraints) is still a long-standing open problem [26]. We show that the case for sequences is Turing-equivalent to the string case.

  2. (ii)

    Use SRA instead of PA. We show that the resulting theory of sequences is undecidable, even over the alphabet theory \(T\) of equality.

  3. (iii)

    Add symbolic transducers. Symbolic transducers [15, 16] extend finite-state input/output transducers in the same way that symbolic automata extend finite-state automata. To obtain decidability, we consider formulas satisfying the straight-line restriction that was defined over strings theories [35]. We show that the resulting theory is decidable in 2-ExpTime and is ExpSpace-hard, if \(T\) is solvable in PSpace.

We have implemented the solver SeCo based on our algorithms, and demonstrated its efficacy on two classes of benchmarks: (i) invariant checking on array-manipulating programs and parameterized systems, and (ii) benchmarks on Symbolic Register Automata (SRA) from [14]. For the first benchmarks, we model as sequence constraints invariants for QuickSort, Dijkstra’s Self-Stabilizing Protocol [20] and Lamport’s Bakery Algorithm [33]. For (ii), we solve decision problems for SRA on benchmarks of [14] such as emptiness, equivalence and inclusion on regular expressions with back-references. We report promising experimental results: our solver SeCo is up to three orders of magnitude faster than the SRA solver in [14].

Organization. We provide a motivating example of sequence theories in Sect. 2. Section 3 contains the syntax and semantics of the sequence constraint language, as well as some basic algorithmic results. We deal with equational and regular constraints in Sect. 4. In Sect. 5, we deal with the decidable fragments with equational constraints, regular constraints, and transducers. We deal with extensions of these languages with length and SRA constraints in Sect. 6. In Sect. 7 we report our implementation and experimental results. We conclude in Sect. 8. Missing details and proofs can be found in the full version.

2 Motivating Example

figure b

We illustrate the use of sequence theories in verification using a implementation of QuickSort [28], shown in Listing 1. The example uses the Java Streams API and resembles typical implementations of QuickSort in functional languages; the program uses high-level operations on streams and lists like filter and concatenation. As we show, the data types and operations can naturally be modelled using a theory of sequences over integer arithmetic, and our results imply decidability of checks that would be done by a verification system.

The function processes a given list  by picking the first element as the pivot  , then creating two sub-lists , in which all numbers (resp., ) have been eliminated. The function is then recursively invoked on the two sub-lists, and the results are finally concatenated and returned.

We focus on the verification of the post-condition shown in the beginning of Listing 1: sorting does not change the set of elements contained in the input list. This is a weaker form of the permutation property of sorting algorithms, and as such known to be challenging for verification methods (e.g., [42]). Sortedness of the result list can be stated and verified in a similar way, but is not considered here. Following the classical design-by-contract approach [37], to verify the partial correctness of the function it is enough to show that the post-condition is established in any top-level call of the function, assuming that the post-condition holds for all recursive calls. For the case of non-empty lists, the verification condition, expressed in our logic, is:

$$\begin{aligned}&\left( \begin{array}{@{}l@{}} \textbf{left} = T_{<\textbf{l}_0}(\textbf{l}) \wedge \textbf{right} = T_{\ge \textbf{l}_0}( skip _1(\textbf{l})) \wedge \\ \forall i.\, (i \in \textbf{left} \leftrightarrow i \in \textbf{left}') \wedge \forall i.\, (i \in \textbf{right} \leftrightarrow i \in \textbf{right}') \wedge \\ \textbf{res} = \textbf{left}' \,.\, [\textbf{l}_0] \,.\, \textbf{right}' \end{array} \right) \\&\qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \rightarrow \forall i.\, (i \in \textbf{l} \leftrightarrow i \in \textbf{res}) \end{aligned}$$

The variables \(\textbf{l}, \textbf{res}, \textbf{left}, \textbf{right}, \textbf{left}', \textbf{right}'\) range over sequences of integers, while i is a bound integer variable. The formula uses several operators that a useful sequence theory has to provide: (i) \(\textbf{l}_0\): the first element of input list \(\textbf{l}\); (ii) \(\in \) and \(\not \in \): membership and non-membership of an integer in a list, which can be expressed using symbolic parametric automata; (iii) \( skip _1\), \(T_{<\textbf{l}_0}\), \(T_{\ge \textbf{l}_0}\): sequence-to-sequence functions, which can be represented using symbolic parametric transducers; (iv) \(\cdot \,.\, \cdot \): concatenation of several sequences. The formula otherwise is a direct model of the method in Listing 1; the variables \(\textbf{left}', \textbf{right}'\) are the results of the recursive calls, and concatenated to obtain the result sequence.

In addition, the formula contains quantifiers. To demonstrate validity of the formula, it is enough to eliminate the last quantifier \(\forall i\) by instantiating with a Skolem symbol k, and then instantiate the other quantifiers (left of the implication) with the same k:

$$\begin{aligned} \left( \begin{array}{@{}l@{}} \textbf{left} = T_{<\textbf{l}_0}(\textbf{l}) \wedge \textbf{right} = T_{\ge \textbf{l}_0}( skip _1(\textbf{l})) \wedge \\ (k \in \textbf{left} \leftrightarrow k \in \textbf{left}') \wedge (k \in \textbf{right} \leftrightarrow k \in \textbf{right}') \wedge \\ \textbf{res} = \textbf{left}' \,.\, [\textbf{l}_0] \,.\, \textbf{right}' \end{array} \right) \rightarrow (k \in \textbf{l} \leftrightarrow k \in \textbf{res}) \end{aligned}$$

As one of the results of this paper, we prove that this final formula is in a decidable logic. The formula can be rewritten to a disjunction of straight-line formulas, and shown to be valid using the decision procedure presented in Sect. 5.

3 Models

In this section, we will define our sequence constraint language, and prove some basic results regarding various constraints in the language. The definition is a natural generalization of string constraints (e.g. see [12, 17, 26, 29, 35]) by employing an alphabet theory (a.k.a. element theory), as is done in symbolic automata and automata modulo theories [15, 16, 44].

For simplicity, our definitions will follow a model-theoretic approach. Let \(\sigma \) be a vocabulary. We fix a \(\sigma \)-structure \(\mathfrak {S}= (D; I)\), where D can be a finite or an infinite set (i.e., the universe) and I maps each function/relation symbol in \(\sigma \) to a function/relation over D. The elements of our sequences will range over D. We assume that the quantifier-free theory \(T_{\mathfrak {S}}\) over \(\mathfrak {S}\) (including equality) is decidable. Examples of such \(T_{\mathfrak {S}}\) are abound from SMT, e.g., LRA and LIA. We write T instead of \(T_{\mathfrak {S}}\), when \(\mathfrak {S}\) is clear. Our quantifier-free formula will use uninterpreted T-constants \(a,b,c,\ldots \), and may also use variables \(x,y,z,\ldots \). (The distinction between uninterpreted constants and variables is made only for the purpose of presentation of sequence constraints, as will be clear shortly.) We use \(\mathcal {C}\) to denote the set of all uninterpreted T-constants. A formula \(\varphi \) is satisfiable if there is an assignment that maps the uninterpreted constants and variables to concrete values in D such that the formula becomes true in \(\mathfrak {S}\).

Next, we define how we lift T to sequence constraints, using T as the alphabet theory (a.k.a. element theory). As in the case of strings (over a finite alphabet), we use standard notation like \(D^*\) to refer to the set of all sequences over D. By default, elements of \(D^*\) are written as standard in mathematics, e.g., 7, 8, 100, when \(D = \mathbb {Z}\). Sometimes we will disambiguate them by using brackets, e.g., (7, 8, 100) or [7, 8, 100]. We will use the symbol s (with/without subscript) to refer to concrete sequences (i.e., a member of \(D^*\)). We will use \({\textbf {x}},{\textbf {y}},{\textbf {z}}\) to refer to T-sequence variables. Let \(\mathcal {V}\) denote the set of all T-sequence variables, and \(\varGamma := \mathcal {C}\cup D\). We will define constraint languages syntactically at the beginning, and will instantiate them to specific sequence operations. The theory \(T^*\) of \(T\)-sequences consists of the following constraints:

$$ \varphi :\,\!:= R({\textbf {x}}_1,\ldots ,{\textbf {x}}_r) \ |\ \varphi \wedge \varphi $$

where R is an r-ary relation symbol. In our definition of each atom R below, we will specify if an assignment \(\mu \), which maps each \({\textbf {x}}_i\) to a T-sequence and each uninterpreted constant to a T-element, satisfies R. If \(\mu \) satisfies all atoms, we say that \(\mu \) is a solution and the satisfiability problem is to decide whether there is a solution for a given \(\varphi \).

A few remarks about the missing boolean operators in the constraint language above are in order. Disjunctions can be handled easily using the DPLL(T) framework (e.g. see [32]), so we have kept our theory conjunctive. As in the case of strings, negations are usually handled separately because they can sometimes (but not in all cases) be eliminated while preserving decidability.

Equational Constraints. A T -sequence equation is of the form

$$ L = R $$

where each of L and R is a concatenation of concrete T-elements, uninterpreted constants, and T-sequence variables. That is, if \(\varTheta := \varGamma \cup \mathcal {V}\), then \(L,R \in \varTheta ^*\).

For example, in the equation

$$ 0. 1. {\textbf {x}}= {\textbf {x}}. 0.1 $$

the set of all solutions is of the form \({\textbf {x}}\mapsto (01)^*\). To make this more formal, we extend each assignment \(\mu \) to a homomorphism on \(\varTheta ^*\). We write \(\mu \models L = R\) if \(\mu (L) = \mu (R)\). Notice that this definition is just direct extension of that of word equations (e.g. see [17]), i.e., when the domain D is finite.

In most cases the inequality constraints \(L \ne R\) can be reduced to equality in our case this requires also element constraints, described below.

Element Constraints. We allow T-formulas to constrain the uninterpreted constants. More precisely, given a T-sentence (i.e., no free variables) \(\varphi \) that uses \(\mathcal {C}\) as uninterpreted constants, we obtain a proposition P (i.e., 0-ary relation) that \(\mu \models P\) iff \(T \models _{\mu } \varphi \).

Negations in the equational constraints can be removed just like in the case of strings, i.e., by means of additional variables/constants and element constraints. For example, \({\textbf {x}}\ne {\textbf {y}}\) can be replaced by \(({\textbf {x}}= {\textbf {z}}a {\textbf {x}}' \wedge {\textbf {y}}= {\textbf {z}}b{\textbf {y}}' \wedge a \ne b) \vee {\textbf {x}}= {\textbf {y}}a{\textbf {z}}\vee {\textbf {x}}a {\textbf {z}}= {\textbf {y}}\). Notice that \(a \ne b\) is a T-formula because we assume the equality symbol in T.

Regular Constraints. Over strings, regular constraints are simply unary constraints \(U( {\textbf {x}})\), where U is an automaton. The interpretation is \({\textbf {x}}\) is in the language of U. We define an analogue of regular constraints over sequences using parametric automata [21, 23, 24], which generalize both symbolic automata [15, 16] and variable automata [27].

A parametric automaton (PA) over T is of the form \(\mathcal {A}= (\mathcal {X},Q,\varDelta ,q_0,F)\), where \(\mathcal {X}\) is a finite set of parameters, \(Q\) is a finite set of control states, \(q_0 \in Q\) is the initial state, \(F\subseteq Q\) is the set of final states, and \(\varDelta {\subseteq _{\text {fin}}}Q\times T({\textit{curr}},\mathcal {X}) \times Q\). Here, parameters are simply uninterpreted T-constants, i.e., \(\mathcal {X}\subseteq \mathcal {C}\). Formulas that appear in transitions in \(\varDelta \) will be referred to as guards, since they restrict which transitions are enabled at a given state. Note that \({\textit{curr}}\) is an uninterpreted constant that refers to the “current” position in the sequence. The semantics is quite simply defined: a sequence \((d_1, d_2, \ldots , d_n)\) is in the language of \(\mathcal {A}\) under the assignment of parameters \(\mu \), written as \((d_1, \ldots , d_n) \in L_\mu (\mathcal {A})\), when there is a sequence of \(\varDelta \)-transitions

$$ (q_0,\varphi _1({\textit{curr}}, \mathcal {X}),q_1), (q_1,\varphi _2({\textit{curr}}, \mathcal {X}),q_2), \ldots , (q_{n-1},\varphi _n({\textit{curr}}, \mathcal {X}),q_n), $$

such that \(q_n \in F\) and \(T \models \varphi _i(d_i, \mu (\mathcal {X}))\). Finally, for a regular constraint \(\mathcal {A}({\textbf {x}})\) is satisfied by \(\mu \), when \(\mu ({\textbf {x}}) \in L_\mu (\mathcal {A})\).

Note, that it is possible to complement a PA \(\mathcal {A}\), one has to be careful with the semantics: we treat \(\mathcal {A}\) as a symbolic automaton, which are closed under boolean operations [15]. So we are looking for \(\mu \) such that \(\mu ({\textbf {x}}) \in \overline{L_\mu ({\textbf {x}})}\). What we cannot do using the complementation, is a universal quantification over the parameters; note that already theory of strings with universal and existential quantifiers is undecidable.

We state next a lemma showing that PAs using only “local” parameters, together with equational constraints, can encode the constraint language that we have defined so far.

Lemma 1

Satisfiability of sequence constraints with equation, element, and regular constraints can be reduced in polynomial-time to satisfiability of sequence constraints with equation and regular constraints (i.e., without element constraints). Furthermore, it can be assumed that no two regular constraints share any parameter.

Proposition 1

Assume that T is solvable in NP (resp. PSpace). Then, deciding nonemptiness of a parametric automaton over T is in NP (resp. PSpace).

The proof is standard (e.g. see [21, 23, 24]), and only sketched here. The algorithm first nondeterministically guesses a simple path in the automaton \(\mathcal {A}\) from an initial state \(q_0\) to some final state \(q_F\). Let us say that the guards appearing in this path are \(\psi _1({\textit{curr}},\mathcal {X}), \ldots ,\psi _k({\textit{curr}},\mathcal {X})\). We need to check if this path is realizable by checking T-satisfiability of

$$ \exists \mathcal {X}.\, \bigwedge _{i=1}^k \exists {\textit{curr}}.\,(\psi _i({\textit{curr}}, \mathcal {X})). $$

It is easy to see that this is an NP (resp. NPSPACE = PSpace) procedure.

Parametric Transducers. We define a suitable extension of symbolic transducers over parameters following the definition from Veanes et al. [44]. A transducer constraint is of the form \({\textbf {y}}= \mathcal {T}({\textbf {x}})\), for a parametric transducer \(\mathcal {T}\). A parametric transducer over T is of the form \(\mathcal {T}= (\mathcal {X},Q,\varDelta ,q_0,F)\), where \(\mathcal {X}\), \(Q\), \(q_0\), \(F\) are just like in parametric automata. Unlike parametric automata, \(\varDelta \) is a finite set of tuples of the form \((p,(\varphi ,{\textbf {w}}),q)\), where \((p,\varphi ,q)\) is a standard transition in parametric automaton, and \({\textbf {w}}\) is a (possibly empty) sequence of T-terms over variable curr and constants \(\mathcal {X}\), e.g., \({\textbf {w}}= (curr+7,curr+2)\). One can think of \({\textbf {w}}\) as the output produced by the transition. Given an assignment \(\mu \) of parameters and the sequence variables, the constraint \({\textbf {y}}= \mathcal {T}({\textbf {x}})\) is satisfied when there is a sequence of \(\varDelta \)-transitions

$$ (q_0,\varphi _1({\textit{curr}}, \mathcal {X}),{\textbf {w}}_1,q_1), (q_1,\varphi _2({\textit{curr}}, \mathcal {X}),{\textbf {w}}_2,q_2), \ldots (q_{n-1},\varphi _n({\textit{curr}}, \mathcal {X}),{\textbf {w}}_n,q_n), $$

such that \(q_n \in F\) and \(T \models \varphi _i(d_i, \mu (\mathcal {X}))\), where \(\mu ({\textbf {x}}) = (d_1,\ldots ,d_n)\), and finally

$$ \mu ({\textbf {y}}) = \mu _1({\textbf {w}}_1) \cdots \mu _n({\textbf {w}}_n) $$

where \(\mu _i\) is \(\mu \) but maps \({\textit{curr}}\) to \(d_i\). The definition assumes that \(\mu _i\) is extended to terms and concatenation thereof by homomorphism, e.g., in LRA, if \({\textbf {w}}_1 = ({\textit{curr}}+ 7,{\textit{curr}}+ 2)\) and \(\mu _1\) maps \({\textit{curr}}\) to 10, then \({\textbf {w}}_1\) will get mapped to 17, 12. Given a set \(S \subseteq D^*\) and an assignment \(\mu \) (mapping the constants to D), we define the pre-image \(\mathcal {T}_{\mu }^{-1}(S)\) of S under \(\mathcal {T}\) with respect to \(\mu \) as the set of sequences \({\textbf {w}}\in D^*\) such that \({\textbf {w}}' = \mathcal {T}({\textbf {w}})\) holds with respect to \(\mu \).

4 Solving Equational and Regular Constraints

Here we present results on solving equational constraints, together with regular constraints, by a reduction to the string case, for which a wealth of results are already available. In general, this reduction causes an exponential blow-up in the resulting string constraint, which we show to be unavoidable in general. That said, we also provide a more refined analysis in the case when the underlying theory is LRA, where we can avoid this exponential blow-up.

Prelude: The Case of Strings. We start with some known results about the case of strings. The satisfiability of word equations with regular constraints is PSpace-complete [18, 19]. This upper bound can be extended to full quantifier-free theory [10]. When no regular constraints are given, the problem is only known to be NP-hard, and it is widely believed to be in NP. In the absence of regular constraints, without loss of generality \(\varGamma \) can be assumed to contain only letters from the equations; this is not the case in presence of regular constraints. The algorithm solving word equations [19] does not need an explicit access to \(\varGamma \): it is enough to know whether there is a letter which labels a given set of transitions in the NFAs used in the regular constraints. In principle, there could be exponentially many different (i.e., inducing different transitions in the NFAs) letters. When oracle access to such alphabet is provided, the satisfiability can still be decided in PSpace: while not explicitly claimed, this is exactly the scenario in [19, Sect. 5.2]

Other constraints are also considered for word equations; perhaps the most widely known are the length constraints, which are of the form: \(\sum _{x \in \mathcal V} a_x \cdot |x| \le c\), where \(\{a_x\}_{x \in \mathcal V}, c\) are integer constants and |x| denotes the length \(|\mu (x)|\), with an obvious semantics. It is an open problem, whether word equations with length constraints are decidable, see [26].

Reduction to Word Equations. We assume Lemma 1, i.e. that the parameters used for different automata-based constraints are pairwise different. In particular, when looking for a satisfying assignment \(\mu \) we can first fix assignment for \(\mathcal {X}\) and then try to extend it to \(\mathcal V\). To avoid confusion, we call this partial assignment \(\pi : \mathcal {X}\rightarrow D\).

Consider a set \(\varPhi \) of all atoms in all guards in the regular constraints together with the set of formulas \(\{x = c\}\) over all constants \(c \in D\) that appear in all equational constraints and the negations of both types of formulas. Fix an assignment \(\pi : \mathcal {X}\rightarrow D\). The type \({{\,\textrm{type}\,}}_\pi (a)\) of a (under assignment \(\pi \)) is the set of formulas in \(\varPhi \) satisfied by a, i.e. \(\{ \varphi \in \varPhi \, : \, \varphi (\pi (\mathcal {X}), a) \text { holds}\}\). Clearly there are at most exponentially many different types (for a fixed \(\pi \)). A type t is realizable (for \(\pi \)) when \(t = {{\,\textrm{type}\,}}_\pi (a)\) and it is realized by a.

If the constraints are satisfiable (for some parameters assignment \(\pi \)) then they are satisfiable over a subset \(D_\pi {\subseteq _{\text {fin}}}D\), in the sense that we assign uniterpreted constants elements from \(D_\pi \) and T-sequence variables elements of \(D_\pi ^*\), where \(D_\pi \) is created by taking (arbitrarily) one element of a realizable type. Note that for each constant c in the equational constraints there is a formula “\(x = c\)” in \(\varPhi \), in particular \({{\,\textrm{type}\,}}_\pi (c)\) is realizable (only by c) and so \(c \in D_\pi \).

Lemma 2

Given a system of constraints and a parameter assignment \(\pi \) let \(D_\pi \subseteq D\) be obtained by choosing (arbitrarily) for each realizable type a single element of this type. Then the set of constraints is satisfiable (for \(\pi \)) over D if and only if they are satisfiable (for \(\pi \)) over \(D_\pi \). To be more precise, there is a letter-to-letter homomorphism \(\psi : D^* \rightarrow D_\pi ^*\) such that if \(\mu \) is a solution of a system of constraints then \(\psi \circ \mu \) is also a solution.

The proof can be found in the full version, its intuition is clear: we map each letter \(a \in D\) to the unique letter in \(D_\pi \) of the same type.

Once the assignment is fixed (to \(\pi \)) and domain restricted to a finite set (\(D_\pi \)), the equational and regular constraints reduce to word equations with regular constraints: treat \(D_\pi \) as a finite alphabet, for a parametric automaton \(\mathcal {A}= (\mathcal {X},Q,\varDelta ,q_0,F)\) create an NFA \(\mathcal {A}' = (D_\pi ,Q,\varDelta ',q_0,F)\), i.e. over the alphabet \(D_\pi \), with the same set of states Q, same starting state \(q_0\) and accepting states F and the relation defined as \((q,a,q') \in \varDelta '\) if and only if there is \((q,\varphi ({\textit{curr}},\mathcal {X}),q') \in \varDelta \) such that \(\varphi (a,\pi (\mathcal {X}))\) holds, i.e. we can move from q to \(q'\) by a in \(\mathcal {A}'\) if and only if we can make this move in \(\mathcal {A}\) under assignment \(\pi \). Clearly, from the construction

Lemma 3

Given an assignment of parameters \(\pi \) let \(D_\mu \) be a set from Lemma 2, \(\mathcal {A}\) be a parametric automaton and \(\mathcal {A}'\) the automaton as constructed above. Then

$$ L_\pi (\mathcal {A}) \cap D_\pi ^* = L(\mathcal {A}') . $$

We can rewrite the parametric automata-constraints with regular constraints and treat equational constraints as word equations (over the finite alphabet \(D_\pi \)). From Lemma 2 and Lemma 3 it follows that the original constraints have a solution for assignment \(\pi \) if and only if the constructed system of constraints has a solution. Therefore once the appropriate assignment \(\pi \) is fixed, the validity of constraints can be verified [19]. It turns out that we do not need the actual \(\pi \), it is enough to know which types are realisable for it, which translates to an exponential-size formula. We will use letter \(\tau \) to denote subset of \(\varPhi \); the idea is that \(\tau = \{{{\,\textrm{type}\,}}_\pi (a) \, : \, a \in D\} \subseteq 2^\varPhi \) and if different \(\pi , \pi '\) give the same sets of realizable types, then they both yield a satisfying assignment or both not. Hence it is enough to focus on \(\tau \) and not on actual \(\pi \).

Lemma 4

Given a system of equational and regular constraints we can non-deterministically reduce them to a formula of a form

$$\begin{aligned} \exists _{t \in \tau } a_t \in D .\, \exists \mathcal {X}\in D^+ .\, \bigwedge _{t \in \tau } \bigwedge _{\varphi \in t} \varphi (\mathcal {X}, a_t) , \end{aligned}$$
(1)

where \(\tau \subseteq 2^\varPhi \) is of at most exponential size, and a system of word equations with regular constraints of linear size and over an \(|\tau |\)-size alphabet, using auxiliary \(\mathcal O(n |\tau |)\) space. The solution of the latter word equations (for which also (1) holds) are solutions of the original system, by appropriate identifications of symbols.

Proof

We guess the set \(\tau \) of types of the assignment of parameters \(\pi \), i.e. \(\tau = \{{{\,\textrm{type}\,}}_\pi (a) \, : \, a \in D\}\) such that there is an assignment \(\mu \) extending \(\pi \); note that as \(\varPhi \) has linearly many atoms and \(\tau \subseteq 2^\varPhi \), then \(|\tau |\) may be of exponential size, in general. The (1) verifies the guess: we validate whether there are values of \(\mathcal {X}\) such that for each type \(t \in \tau \) there is a value a such that \({{\,\textrm{type}\,}}_\pi (a) = t\).

Let \(D_\pi \) be a set having one symbol per every type in \(\tau \), as in Lemma 2; note that this includes all constants in the equational constraints. The algorithm will not have access to particular values, instead we store each \(t \in \tau \), say as a bitvector describing which atoms in \(\varPhi \) this letter satisfies. In particular, \(|D_\pi | = |\tau |\) and it is at most exponential. In the following we will consider only solutions over \(D_\pi \).

For each \(a \in D_\pi \) we can validate, which transitions in \(\mathcal {A}\) it can take: the transition is labelled by a guard which is a conjunction of atoms from \(\varPhi \) and either each such atom is in \({{\,\textrm{type}\,}}_\pi (a)\) or not. Hence we can treat \(\mathcal {A}\) as an NFA for \(D_\pi \). We do not need to construct nor store it, we can use \(\mathcal {A}\): when we want to make a transition by \(\varphi (\mathcal {X},a)\) we look up, whether each atom of \(\varphi \) is in \({{\,\textrm{type}\,}}_\pi (a)\) or not. Similarly, the constraint \(\mathcal {A}({\textbf {x}})\) is restricted to \({\textbf {x}}\in L_\pi (\mathcal {A})\) and for \({\textbf {x}}\in D_\pi ^*\) this is a usual regular constraint.

We treat equational constraints as word equations over alphabet \(D_\pi \).

Concerning the correctness of the reduction: if the system of word equations (with regular constraints) is satisfiable and the formula (1) is also satisfiable, then there is a satisfying assignment \(\mu \) over \(D_\pi \) and \(D_\pi ^*\) in particular, there is an assignment of parameters for which there are letters of the given types (note that in principle it could be that \(\mu \) induces more types, i.e. there is a value a such that \({{\,\textrm{type}\,}}_\mu (a) \notin \tau \) and so it is not represented in \(D_\pi \), but this is fine: enlarging the alphabet cannot invalidate a solution), i.e. the transitions for \(a_t\) in the automata after the reduction are the same as in the corresponding parametric automata for the assignment \(\pi \), this is guaranteed by the satisfiability of (1) and the way we construct the instance, see Lemma 3.

On the other hand, when there is a solution of the input constraints, there is one for some assignment of parameters \(\pi \). Hence, by Lemma 2, there is a solution over \(D_\pi \). The algorithm guesses \(\tau = \{{{\,\textrm{type}\,}}_\pi (a) \, : \, a\in D\}\) and (1) is true for it. Then by Lemma 2 there is a solution over \(D_\pi \) as constructed in the reduction and by Lemma 3 the regular constraints define the same subsets of \(D_\pi ^*\) both when interpreted as parametric automata and NFAs.    \(\square \)

Theorem 1

If theory T is in PSpace then sequence constraints are in ExpSpace.

If \(\tau \) is polynomial size and the formula (1) can be verified in PSpace, then sequence constraints can be verified in PSpace.

One of the difficulties in deciding sequence constraints using the word equations approach is the size of set of realizable types \(\tau \), which could be exponential. For some concrete theories it is known to be smaller and thus a lower upper bound on complexity follows. For instance, it is easy to show that for LRA there are linearly many realizable types, which implies a PSpace upper bound.

Corollary 1

Sequence constraints for Linear Real Arithmetic are in PSpace.

In general, the ExpSpace upper bound from Theorem 1 cannot be improved, as even non-emptiness of intersection of parametric automata is ExpSpace-complete for some theories decidable in PSpace. This is in contrast to the case of symbolic automata, for which the non-emptiness of intersection (for a theory \(T\) decidable in PSpace) is in PSpace. This shows the importance of parameters in our lower bound proof.

Theorem 2

There are theories with existential fragment decidable in PSpace and whose non-emptiness of intersection of parametric automata is ExpSpace-complete.

When no regular constraints are allowed, we can solve the equational and element constraints in PSpace (note that we do not use Lemma 1).

Theorem 3

For a theory \(T\) decidable in PSpace, the element and equational constraints (so no regular constraints) can be decided in PSpace.

5 Algorithm for Straight-Line Formulas

It is known that adding finite transducers into word equations results in an undecidable model (e.g. see [35]). Therefore, we extend the straight-line restriction [12, 35] to sequences, and show that it suffices to recover decidability for equational constraints, together with regular and transducer constraints. In fact, we will show that deciding problems in the straight-line fragment is solvable in doubly exponential time and is ExpSpace-hard, if T is solvable in PSpace. It has been observed that the straight-line fragment for the theory of strings already covers many interesting benchmarks [12, 35], and similarly many properties of sequence-manipulating programs can be proven using the fragment, including the QuickSort example from Sect. 2 and other benchmarks shown in Sect. 7.

The Straight-Line Fragment SL. We start by defining recognizable formulas over sequences, followed by the syntactic and semantic restrictions on our constraint language. This definition follows closely the definition of recognizable relations over finite alphabets, except that we replace finite automata with parametric automata.

Definition 1 (Recognizable formula)

A formula \(R({\textbf {x}}_1,\ldots ,{\textbf {x}}_r)\) is recognizable if it is equivalent to a positive Boolean combination of regular constraints.

Note that this is simply a generalization of regular constraints to multiple variables, i.e., 1-ary recognizable formula can be turned into a regular constraint, which is closed under intersection and union.

To define the straight-line fragment, we use the approach of [12]; that is, the fragment is defined in terms of “feasibility of a symbolic execution”. Here, a symbolic execution is just a sequence of assignments and assertions, whereas the feasibility problem amounts to deciding whether there are concrete values of the variables so that the symbolic execution can be run and none of the assertions are violated. We now make this intuition formal. A symbolic execution is syntactically generated by the following grammar:

$$\begin{aligned}&S ~~:\,\!:=~~ {\textbf {y}}:=f({\textbf {x}}_1,\dots , {\textbf {x}}_k, \mathcal {X}) \;|\; {\textbf {assert}}(R({\textbf {x}}_1,\dots ,{\textbf {x}}_r)) \;|\; {\textbf {assert}}(\varphi ) \;|\;S;S&\end{aligned}$$
(2)

where \(f: (D^*)^k \times D^{|\mathcal {X}|} \rightarrow D\) is a function, R are recognizable formulas, and \(\varphi \) are element constraints.

The symbolic execution S can be turned into a sequence constraint as follows. Firstly, we can turn S into the standard Static Single Assignment (SSA) form by means of introducing new variables on the left-hand-side of an assignment. For example, \({\textbf {y}}:= f({\textbf {x}}); {\textbf {y}}:= g({\textbf {z}})\) becomes \({\textbf {y}}:= f({\textbf {x}}_1); {\textbf {y}}' := g({\textbf {z}})\). Then, in the resulting constraint, each variable appears at most once on the left-hand-side of an assignment. That way, we can simply replace each assignment symbol \(:=\) with an equality symbol \(=\). We then treat each sequential composition as the conjunction operator \(\wedge \) and assertion as a conjunct. Note that individual assertions are already sequence constraints. Next, we define how an interpretation \(\mu \) satisfies the constraint \({\textbf {y}}= f({\textbf {x}}_1,\ldots ,{\textbf {x}}_r,\mathcal {X})\):

$$ \mu \models {\textbf {y}}= f({\textbf {x}}_1,\ldots ,{\textbf {x}}_r,\mathcal {X}) \quad \text {iff} \quad \mu ({\textbf {y}}) = f(\mu ({\textbf {x}}_1),\ldots ,\mu ({\textbf {x}}_r),\mu (\mathcal {X})). $$

Note that ’=’ on the l.h.s. is syntactic, while the ’=’ on the r.h.s. is in the metalanguage. The definition of the semantics of the language is now inherited from Sect. 3.

In addition to the syntactic restrictions, we also need a semantic condition: in our language, we only permit functions f such that the pre-image of each regular constraint under f is effectively a recognizable formula:

  • (RegInvRel) A function f is permitted if for each regular constraint \(\mathcal {A}( {\textbf {y}})\), it is possible to compute a recognizable formula that is equivalent to the formula \(\exists {\textbf {y}}: \mathcal {A}({\textbf {y}}) \wedge {\textbf {y}}= f({\textbf {x}}_1,\ldots ,{\textbf {x}}_r,\mathcal {X})\).

Two functions satisfying (RegInvRel) are the concatenation function \({\textbf {x}}:= {\textbf {y}}.{\textbf {z}}\) (here \({\textbf {y}}\) could be the same as \({\textbf {z}}\)) and parametric transducers \({\textbf {y}}:= \mathcal {T}({\textbf {x}})\). We will only use these two functions in the paper, but the result is generalizable to other functions.

Proposition 2

Given a regular constraint \(\mathcal {A}( {\textbf {y}})\) and a constraint \({\textbf {y}}= {\textbf {x}}.{\textbf {z}}\), we can compute a recognizable formula \(\psi ({\textbf {x}},{\textbf {z}})\) equivalent to \(\exists {\textbf {y}}: \mathcal {A}({\textbf {y}}) \wedge {\textbf {y}}= {\textbf {x}}.{\textbf {z}}\). Furthermore, this can be achieved in polynomial time.

The proof of this proposition is exactly the same as in the case of strings, e.g., see [12, 35].

Proposition 3

Given a regular constraint \(\mathcal {A}( {\textbf {y}})\) and a parametric transducer constraint \({\textbf {y}}= \mathcal {T}({\textbf {x}})\), we can compute a regular constraint \(\mathcal {A}'( {\textbf {x}})\) that is equivalent to \(\exists {\textbf {y}}: \mathcal {A}({\textbf {y}}) \wedge {\textbf {y}}= \mathcal {T}({\textbf {x}})\). This can be achieved in exponential time.

The construction in Proposition 3 is essentially the same as the pre-image computation of a symbolic automaton under a symbolic transducer [44]. The complexity is exponential in the maximum number of output symbols of a single transition (i.e. the maximum length of \({\textbf {w}}\) in the transducer), which is in practice a small natural number.

The following is our main theorem on the SL fragment with equational constraints, regular constraints, and transducers.

Theorem 4

If T is solvable in PSpace, then the SL fragment with concatenation and parametric transducers over T is in 2-ExpTime and is ExpSpace-hard.

Proof

We give a decision procedure. We assume that S is already in SSA (i.e. each variable appears at most once on the left-hand side). Let us assume that S is of the form \(S';{\textbf {y}}:= f({\textbf {x}}_1,...{\textbf {x}}_r)\), for some symbolic execution \(S'\). Without loss of generality, we may assume that each recognizable constraint is of the form \(\mathcal {A}( {\textbf {x}})\). This is no limitation: (1) since each R in the assertion is a recognizable formula, we simply have to “guess” one of the implicants for each R, and (2) \({\textbf {assert}}(\psi _1 \wedge \psi _2)\) is equivalent to \({\textbf {assert}}(\psi _1); {\textbf {assert}}(\psi _2)\).

Assume now that \(\{\mathcal {A}_1({\textbf {y}}),\ldots ,\mathcal {A}_m({\textbf {y}})\}\) are all the regular constraints on \({\textbf {y}}\) in S. By our assumption, it is possible to compute a recognizable formula equivalent to

$$ \psi ({\textbf {x}}_1,\ldots ,{\textbf {x}}_r) := \exists {\textbf {y}}: \bigwedge _{i=1}^m \mathcal {A}_i({\textbf {y}}) \wedge {\textbf {y}}= f({\textbf {x}}_1,\ldots ,{\textbf {x}}_r). $$

There are two ways to see this. The first way is that regular constraints are closed under intersection. This is in general computationally quite expensive because of a product automata construction before applying the pre-image computation. A better way to do this is to observe that \(\psi \) is equivalent to the conjunction of \(\psi _i\)’s over \(i=1,\ldots ,m\), where

$$ \psi _i := \exists {\textbf {y}}: \mathcal {A}_i({\textbf {y}}) \wedge {\textbf {y}}= f({\textbf {x}}_1,\ldots ,{\textbf {x}}_r). $$

By our semantic condition, we can compute recognizable formulas \(\psi _i',\ldots ,\psi _m'\) equivalent to \(\psi _1,\ldots ,\psi _m\) respectively. Therefore, we simply replace S by

$$ S';{\textbf {assert}}(\psi _1');\cdots ;{\textbf {assert}}(\psi _m'), $$

in which every occurrence of \({\textbf {y}}\) has been completely eliminated. Applying the above variable elimination iteratively, we obtain a conjunction of regular constraints. We now end up with a conjunction of regular constraints and element constraints, which as we saw from Sect. 4 is decidable.    \(\square \)

Fig. 1.
figure 1

\(\mathcal {A}_0\) accepts all words not containing k and \(\mathcal {A}_1\) accepts all words containing k.

Example 1

We consider the example from Sect. 2 where a weaker form of the permutation property is shown for QuickSort. The formula that has to be proven is a disjunction of straight-line formulas and in the following we execute our procedure only on one disjunct without redundant formulas:

$${\textbf {assert}}(\mathcal {A}_0(\textbf{left}'));{\textbf {assert}}(\mathcal {A}_0(\textbf{right}')); \textbf{res} = \textbf{left}' \,.\, [\textbf{l}_0] \,.\, \textbf{right}'; {\textbf {assert}}(\mathcal {A}_1(\textbf{res})) $$

We model \(L(\mathcal {A}_1)\) as the language which accepts all words which contain one letter equal to k and \(L(\mathcal {A}_0)\) as the language which accepts only words not containing k, where k is an uninterpreted constant, so a single element. See Fig. 1. We begin by removing the operation \(\textbf{res} = \textbf{left}' \,.\, [\textbf{l}_0] \,.\, \textbf{right}'\). The product automaton for all assertions that contain \(\textbf{res}\) is just \(\mathcal {A}_1\). Hence, we can remove the assertion \({\textbf {assert}}( \mathcal {A}_1(\textbf{res}))\). The concatenation function . satisfies RegInvRel and the pre-image g can be represented by

$$ \bigvee _{0\le i,j\le 1} \mathcal {A}_1^{q_0,\{q_i\}}(\mathbf {left'}) \wedge \mathcal {A}_1^{q_i, \{q_j\}}([\textbf{l}_0]) \wedge \mathcal {A}_1^{q_j, \{q_1\}}(\textbf{right}'), $$

where \(\mathcal {A}_i^{p,F'}\) is \(\mathcal {A}_i\) with start state set to p and finals to \(F'\).

In the next step, the assertion g is added to the program and all assertions containing \(\textbf{res}\) and the concatenation function are removed.

$$\begin{aligned}\begin{gathered} {\textbf {assert}}(\mathcal {A}_0(\textbf{left}'));{\textbf {assert}}(\mathcal {A}(\textbf{right}'));{\textbf {assert}}(g(\textbf{left}',[\textbf{l}_0], \textbf{right}')) \end{gathered}\end{aligned}$$

From here, we pick a tuple from g, lets say \(i = j = 1\), and obtain

$$\begin{aligned}\begin{gathered} {\textbf {assert}}(\mathcal {A}_0(\textbf{left}'));{\textbf {assert}}(\mathcal {A}_0(\textbf{right}')); {\textbf {assert}}(\textbf{left}' \in \mathcal {A}_1^{q_0,\{q_1\}});\\ {\textbf {assert}}([\textbf{l}_0] \in \mathcal {A}_1^{q_1,\{q_1\}});{\textbf {assert}}(\textbf{right}' \in \mathcal {A}_1^{q_1,\{q_1\}}) \end{gathered}\end{aligned}$$

Finally, the product automata \(\mathcal {A}_0 \times \mathcal {A}_1^{q_0,\{q_1\}}\) and \(\mathcal {A}_0 \times \mathcal {A}_1^{q_0,\{q_1\}}\) are computed for the variables \(\textbf{left}', \textbf{right}'\) and a non-emptiness check over the product automata and the automaton for \([\textbf{l}_0]\) is done. The procedure will find no combination of paths for each automaton which can be satisfied, since \(\textbf{left}'\) is forced to accept no words containing k by \(\mathcal {A}_0\) and only accepts by reading a k from \(\mathcal {A}_1^{q_0,\{q_1\}}\). Next, the procedure needs to exhaust all tuples from \((\mathcal {A}_1^{q_0,\{q_i\}}, \mathcal {A}_1^{q_i, \{q_j\}}, \mathcal {A}_1^{q_j, \{q_1\}})_{0 \le i,j \le 1}\) before it is proven that this disjunct is unsatisfiable.

6 Extensions and Undecidability

Length Constraints. We consider the extension of our model by allowing length-constraints on the sequence variables: for each sequence variable \({\textbf {x}}\) we consider the associated length variable \(\ell _{\textbf {x}}\), let the set of length variables be \(\mathcal L =\{ \ell _{\textbf {x}}\, : \, {\textbf {x}}\in \mathcal V\}\), we extend \(\mu \) to \(\mathcal L\), it assigns natural numbers to them. The length constraints are of the form \(\sum _{{\textbf {x}}} a_{\textbf {x}}\ell _{\textbf {x}}? 0\), where \(? \in \{<, \le , =, \ne , \ge , >\}\) and each \(a_{\textbf {x}}\) is an integer constant, i.e., linear arithmetic formulas on the length-variables. The semantics is natural: we require that \(|\mu ({\textbf {x}})| = \mu (\ell _{\textbf {x}})\) (the assigned values are the true lengths of sequences) and that \(\mu (\mathcal L)\) satisfies each length constraint.

There is, however, another possible extensions: if we the theory \(T_\mathfrak {S}\) is the Presburger arithmetic, then the parameter automata could use the values \(\ell _{\textbf {x}}\). We first deal with a more generic, though restricted case, when this is not allowed: then all reductions from Sect. 4 generalize and we can reduce to the word equations with regular and length constraints. However, the decidability status of this problem is unknown. When we consider Presburger arithmetic and allow the automata to employ the length variables, then it turns out that we can interpret the formula (1) as a collection of length constraints, and again we reduce to word equations with regular and length constraints.

Automata Oblivious of Lengths. We first consider the setting, in which the length variables \(\mathcal L\) can only be used in length constraints. It is routine to verify that the reduction from Sect. 4 generalize to the case of length constraints: it is possible to first fix \(\mu \) for parameters, calling it again \(\pi \). Then Lemma 2 shows that each solution \(\mu \) can be mapped by a letter-to-letter homomorphism to a finite alphabet \(D_\pi \), and this mapping preserves the satisfiability/unsatisfiability of length constraints, so Lemma 2 still holds when also length constraints are allowed. Similarly, Lemma 3 is also not affected by the length constraints and finally Lemma 4 deals with regular and equational constraints, ignoring the other possible constraints and the length of substitutions for variables are the same. Hence it holds also when the length constraints are allowed then the resulting word equations use regular and length constraints.

Unfortunately, the decidability of word equations with linear length constraints (even without regular constraints) is a notorious open problem. Thus instead of decidability, we get Turing-equivalent problems.

Theorem 5

Deciding regular, equational and length constraints for T-sequences of a decidable theory \(T\) is Turing-equivalent to word equations with regular and length constraints.

Automata Aware of the Sequence Lengths. We now consider the case when the underlying theory \(T_\mathfrak {S}\) is the Presburger arithmetic, i.e. \(\mathfrak {S}\) is the natural numbers and we can use addition, constants 0, 1 and comparisons (and variables). The additional functionality of the parametric automaton \(\mathcal {A}\) is that \(\varDelta {\subseteq _{\text {fin}}}Q\times T({\textit{curr}},\mathcal {X}, \mathcal L) \times Q\), i.e. the guards can also use the length variables; the semantics is extended in the natural way.

Then the type \({{\,\textrm{type}\,}}_{\pi }(a)\) of \(a \in \mathbb N\) now depends on \(\mu \) values on \(\mathcal {X}\) and \(\mathcal L\), hence we denote by \(\pi \) the restriction of \(\mu \) to \(\mathcal {X}\cup \mathcal L\). Then Lemma 2, 3 still hold, when we fix \(\pi \). Similarly, Lemma 4 holds, but the analogue of (1) now uses also the length variables, which are also used in the length constraints. Such a formula can be seen as a collection of length constraints for original length variables \(\mathcal L\) as well as length variables \(\mathcal {X}\cup \{a_t \, : \, t \in \tau \}\). Hence we validate this formula as part of the word equations with length constraints. Note that \(a_t\) has two roles: as a letter in \(D_{\pi }\) and as a length variable. However, the connection is encoded in the formula from the reduction (analogue of (1)) and we can use two different sets of symbols.

Theorem 6

Deciding conjunction of regular, equational and length constraints for sequences of natural numbers with Presburger arithmetic, where the regular constraints can use length variables, is Turing-equivalent to word equations with regular and (up to exponentially many) length constraints.

Undecidability of Register Automata Constraints. One could use more powerful automata for regular constraints; one such popular model are register automata; informally, such automaton has k registers \(r_1, \ldots , r_k\) and its transition depends on state and a value of formula using the registers and \({\textit{curr}}\): the read value [23]; note that the registers can be updated: to \({\textit{curr}}\) or to one of register’s values; this is specified in the transition. In “classic” register automata guards can only use equality and inequality between registers and \({\textit{curr}}\); in SRA model more powerful atoms are allowed. We show that sequence constraints and register automata constraints (which use quantifier-free formulas with equality and inequality as only atoms, i.e. do not employ the SRA extension) lead to undecidability (over infinite domain D).

Theorem 7

Satisfiability of equational constraints and register automata constraints, which use equality and inequality only, over infinite domain, is undecidable.

7 Implementations, Optimizations and Benchmarks

Implementation. We have implemented our decision procedure for problems in the constraint language SL for the theory of sequences in a new tool SeCo (Sequence Constraint Solver) on top of the SMT solver Princess [41]. We extend a publicly available library for symbolic automata and transducers [13] to parametric automata and transducers by connecting them to the uninterpreted constants in our theory of sequences. Our tool supports symbolic transducers, concatenation of sequences and reversing of sequences. Any additional function which satisfies RegInvRel such as a replace function which replaces only the first and leftmost longest match can be added in the future.

Our algorithm is an adaption of the tool OSTRICH [12] and closely follows the proof of Theorem 4. To summarize the procedure, a depth-first search is employed to remove all functions in the given input and splitting on the pre-images of those functions. When removing a function, new assertions are added to the pre-image constraints. After all functions have been removed and only assertions are left a nonemptiness check is called over all parametric automata which encoded the assertions. If the check is successful a corresponding model can be constructed, otherwise the procedure computes a conflict set and back-jumps to the last split in the depth search.Footnote 2

Benchmarks. We have performed experiments on two benchmark suites. The first one concerns itself with the verification of properties for programs manipulating sequences. The second benchmark suite compares our tool against an algorithm using symbolic register automata [13] on decision procedures of regular expressions with back-references such as emptiness, equivalence and inclusion.

Both benchmark suites require universal quantification over the parameters; there are existing methods for eliminating these universal quantifiers, one such class are the semantically deterministic (SD) [22] PAs; despite its name, being SD is algorithmically checkable. Most of considered the PAs are SD, in particular all in benchmark suite 2.

Experiments were conducted on an AMD Ryzen 5 1600 Six-Core CPU with 16 GB of RAM running on Windows 10. The results for second benchmark suite is shown Table 1. The timeout for all benchmarks is 300 s.

In the first benchmarks suite we are looking to verify a weaker form of the permutation property of sorting as shown in Sect. 2. Furthermore, we verify properties of two self-stabilizing algorithms for mutual exclusion on parameterized systems. The first one is Lamport’s bakery algorithm [33], for which we proved that the algorithm ensures mutual exclusion. The system is modelled in the style of regular model checking [8], with system states represented as words, here over an infinite alphabet: the character representing a thread stores the thread control state, a Boolean flag, and an integer as the number drawn by the thread. The system transitions are modelled as parametric transducers, and invariants as parametric automata. The second algorithm is known as Dijkstra’s Self-Stabilizing Protocol [20], in which system states are encoded as sequences of integers, and in which we verify that the set of states in which exactly one processor is privileged forms an invariant. The mentioned benchmarks require universal quantification, but similar to the motivating example from Sect. 2 one can eliminate quantifiers by Skolemization and instantiation which was done by hand.

The second benchmark suite consists of three different types of benchmarks, summarized in Table 1. The benchmark PR-Cn describes a regular expression for matching products which have the same code number of length n, and PR-CLn matches not only the code number but also the lot number. The last type of benchmark is IP-n, which matches n positions of 2 IP addresses. The benchmarks are taken from the regular-expression crowd-sourcing website RegExLib [39] and are also used in experiments for symbolic register automata [14] which we also compare our results against. To apply our decision procedure to the benchmarks, we encode each of the benchmarks as a parametric automaton, using parameters for the (bounded-size) back-references. The task in the experiments is to check emptiness, language equivalence, and language inclusion for the same combinations of the benchmarks as considered in [14].

Table 1. Benchmark suite 2. SRA is used for the algorithm for symbolic register automata and SEQ for our tool. The symbol \(\emptyset \) indicates the column where emptiness was checked, \(\equiv \) indicates self equivalence and \(\subseteq \) inclusion of languages.

Results of the Experiments. All properties can be encoded by parametric automata with very few states and parameters. As a result the properties for each program can be verified in < 2.6 s, in detail the property for Dijkstra’s algorithm was proven in 0.6 s, QuickSort in 1.1 s and Lamport’s bakery algorithm in 2.5 s.

The results for the second benchmark suite are shown in Table 1. The algorithm for symbolic register automata times out on 11 of the 36 benchmarks and our tool solves most benchmarks in <1 s. One thing to observe that the symbolic register automata scales poorly when more registers are needed to capture the back-references while the performance of our approach does not change noticeably when more parameters are introduced.

8 Conclusion and Future Work

In this paper, we have performed a systematic investigation of decidability and complexity of constraints on sequences. Our starting point is the subcase of string constraints (i.e. over a finite set of sequence elements), which include equational constraints with concatenation, regular constraints, length constraints, and transducers. We have identified parametric automata (extending symbolic automata and variable automata) as suitable notion of “regular constraints” over sequences, and parametric transducers (extending symbolic transducers) as suitable notion of transducers over sequences. We showed that decidability results in the case of strings carry over to sequences, although the complexity is in general higher than in the case of strings (sometimes exponentially higher). For certain element theory (e.g. Linear Real Arithmetic), it is possible to retain the same complexity as in the string case. We also delineate the boundary of the suitable notion of “regular constraints” by showing that the equational constraints with symbolic register automata [14] yields undecidable satisfiability. Finally, our new sequence solver SeCo shows promising experimental results.

There are several future research avenues. Firstly, the complexity of sequence constraints over other specific element theories (e.g. Linear Integer Arithmetic) should be precisely determined. Secondly, is it possible to recover decidability with other fragments of register automata (e.g., single-use automata [7])? On the implementation side, there are some algorithmic improvements, e.g., better nonemptiness checks for parametric automata in the case of a single automaton, as well as product of multiple automata.