Document Spanners: From Expressive Power to Decision Problems

Open Access
Article
Part of the following topical collections:
  1. Special Issue on Database Theory

Abstract

We examine document spanners, a formal framework for information extraction that was introduced by Fagin, Kimelfeld, Reiss, and Vansummeren (PODS 2013, JACM 2015). A document spanner is a function that maps an input string to a relation over spans (intervals of positions of the string). We focus on document spanners that are defined by regex formulas, which are basically regular expressions that map matched subexpressions to corresponding spans, and on core spanners, which extend the former by standard algebraic operators and string equality selection. First, we compare the expressive power of core spanners to three models – namely, patterns, word equations, and a rich and natural subclass of extended regular expressions (regular expressions with a repetition operator). These results are then used to analyze the complexity of query evaluation and various aspects of static analysis of core spanners. Finally, we examine the relative succinctness of different kinds of representations of core spanners and relate this to the simplification of core spanners that are extended with difference operators.

Keywords

Information extraction Document spanners Regular expressions Xregex Patterns Word equations Decision problems Descriptional complexity 

1 Introduction

Information Extraction (IE) is the task of automatically extracting structured information from texts. This paper examines document spanners (also called spanners), a formalization of the IE query language AQL, which is used in IBM’s SystemT. Document spanners were introduced by Fagin et al. [7] in order to allow the theoretical examination of AQL, and were also used in [8].

A span is an interval on positions of a string w, and a spanner is a function that maps w to a relation over spans of w. A central topic of [7] and of the present paper are core spanners (according to Fagin et al., this name was chosen because core spanners capture the core of AQL).

The primitive building blocks of core spanners are regex formulas, which are regular expressions with variables. Each of these variables corresponds to a subexpression, and whenever a regex formula α matches a string w, each variable is mapped to the span in w that matches that subexpression. For example, consider the regex formula α := x{aaa} ⋅ a+y{a+}, with terminal a, and variables x and y. When α matches a string w, it maps x to the span that contains the first three positions of w, and y to a span from some position after the third to the last position of w. Hence, each match of α on w determines a tuple of spans; and as there can be multiple matches of a regex formula to a string, this process creates a relation over spans of w. Core spanners are then defined by extending regex formulas with the relational operations projection, union, natural join, and string equality selection.

One of the two main topics of the present paper is the examination of decision problems for core spanners, in particular evaluation and static analysis. These results are mostly derived from the other main topic, the examination of the expressive power of core spanners in relation to three other models that use repetition operators, which act similar to the spanners’ string equality selection.

We begin with comparing core spanners to patterns. A pattern is word that consists of variables and terminals, and generates the language of all words that can be obtained by substitution of the variables with arbitrary terminal words. For example, the pattern α = xxaby (where x and y are variables, and a and b are terminals) generates the language of all words that have a prefix that consists of a square, followed by the word ab. Although pattern languages have a simple definition, various decision problems for them are surprisingly hard. For example, their membership problem is NP-complete (cf. Angluin [1], Jiang et al. [24]), and their inclusion problem is undecidable (cf. Bremer and Freydenberger [4]). As we show that core spanners can recognize pattern languages, this allows us to conclude that evaluation of Boolean core spanners is NP-hard, and that spanner containment is undecidable.

Next, we consider word equations, which are equations of the form α = β, where α and β are patterns. Word equations can be used to define languages and word relations. We show that word equations with regular constraints can express all relations that are expressible with core spanners. By using an improved version of Makanin’s algorithm (cf. Diekert [6]), this allows us to show that satisfiability and hierarchicality for core spanners can be decided in PSPACE. Moreover, using coding techniques from word equations, we show that two common relations from combinatorics on words can be selected with core spanners.

Finally, we examine the relation of core spanners to xregexes (also called extended regular expressions, regexes, or regular expressions with backreferences in literature). These are regular expressions that can use a repetition operator, that is available in most modern implementations for regular expressions (see, e. g., Friedl [17]) and that allows the definition of non-regular languages. For example, the xregex x{ Σ} ⋅ &x ⋅ &x generates all cubic words over Σ , as x{ Σ} generates some word w which is stored in the variable x, and each occurrence of &x repeats that w. As a consequence of this increase in expressive power, many decision problems are harder for xregexes than for their “classical” counterparts. In particular, various problems of static analysis are undecidable (Freydenberger [12]).

But as shown by Fagin et al. [7], core spanners cannot define all languages that are definable by xregexes. Intuitively, the reason for this is that xregexes can use their repetition operators inside a Kleene star, which allows them to repeat an arbitrary word an unbounded number of times – for example, the xregex x{ Σ}⋅&x+ generates the language of all wn, n ≥ 2. In contrast to this, core spanners have to express repetitions with variables and string equality selections. Inspired by this observation, we introduce variable-star free (or vstar-free) xregexes as those xregexes that neither define nor use variables inside a Kleene star. We show that every vstar-free xregex can be converted into an equivalent core spanner. Since all undecidability results by Freydenberger [12] also apply to vstar-free xregexes, these undecidability results carry over to core spanners. This also has various consequences for the minimization and the relative succinctness of classes of spanner representations. We also show that complementing a core spanner can lead to a size increase that is not bounded by any recursive function (for basically all natural notions of size). Although this does not solve an open problem by Fagin et al. [7] on the simplification of core spanners with difference operators, it shows that if simplification is possible, it has to be non-computable. As a further contribution, we also develop tools to prove inexpressibility for vstar-free regular expressions and for core spanners.

As we shall see, many of the observed lower bounds hold even for comparatively restricted classes of core spanners (in particular, most of the results hold for spanners that do not use join). Hence, the authors consider it reasonable to expect that these results can be easily adapted to other information extraction languages that combine regular expressions with capture variables and a string equality operator.

In addition to regex formulas, Fagin et al. [7] also consider two types of automata as basic building blocks of spanner representations. While the present paper does not discuss these in detail, most of the results on spanner representations that are based on regex formulas can be directly converted to the respective class of spanner representations that are based on automata.

Related Work

For an overview of related models, we refer to Fagin et al. [7]. In addition to this, we highlight connections to models with similar properties. In [7], Fagin et al. showed that there is a language that can be defined by xregexes, but not by core spanners. Furthermore, they compared the expressive power of core spanners and a variant of conjunctive regular path queries (CRPQs), a graph querying language. Barceló et al. [2] introduced extended CRPQs (ECRPQs), which can compare paths in the graph with regular relations. While there is no direct connection between ECRPQs and core spanners, both models share the basic idea of combining regular languages with a comparison operator that can express string equality. As shown by Freydenberger and Schweikardt [16], ECRPQs have undecidability results that are comparable to those in the present paper, and to those for xregexes (cf. Freydenberger [12]). Furthermore, Barceló and Muñoz [3] have used word equations with regular constraints for variants of CRPQs.

Also note that Freydenberger [13] extends the results on the connection between word equations and core spanners from the present paper into a logic on words that has the same expressive power as core spanners.

Structure of the Paper

In Section 2, we give definitions of xregexes and of core spanners. Section 3 compares the expressive power of core spanners to patterns, word equations, and vstar-free regular expressions. The results from this section are then used in Section 4 to examine the complexity of evaluation and static analysis of spanners. We also examine the consequences of these results to the relative succinctness of different spanner representations. Section 5 concludes the paper.

2 Preliminaries

Let \(\mathbb {N}\) and \(\mathbb {N}_{>0}\) be the sets of non-negative and positive integers, respectively. Let Σ be a fixed finite alphabet of (terminal) symbols. Except when stated otherwise, we assume | Σ | ≥ 2. We use ε to denote the empty word. For every word wΣ and every aΣ , let |w| denote the length of w, and |w|a the number of occurrences of a in w. A word xΣ is a subword of a word yΣ if there exist u, vΣ with y = uxv . A word xΣ is a prefix of a word yΣ if there exists a vΣ with y = xv , and a proper prefix if it is a prefix and xy. For every \(n\in \mathbb {N}\), an n-ary word relation (over Σ) is a subset of (Σ)n.

2.1 Regexes (Extended Regular Expressions)

This section introduces the syntax and semantics of xregexes, which we shall also use for regex formulas in Section 2.2. We begin with the syntax, which follows the definition from [7].

Definition 2.1

We fix an infinite set X of variables and define the set M of meta symbols as M := {ε, , (,), {,}, ⋅, ∨, , &}. Let Σ , X, and M be pairwise disjoint. The set of x regexes (extended regular expressions) is defined as follows:
  1. 1.

    The symbols and ε, and every aΣ are xregexes.

     
  2. 2.

    If α1 and α2 are xregex, then (α1α2) (concatenation), (α1α2) (disjunction), and \((\alpha _{1}^{*})\)(Kleene star) are xregexes.

     
  3. 3.

    For every xX and every xregex α that contains neither x{⋯} nor &x as a subword, x{α} is an xregex (variable binding).

     
  4. 4.

    For every xX, we have that &x is an xregex (variable reference).

     
If a subword β of an xregex α is an xregex itself, we call β a subexpression (of α). The set of all subexpressions of α is denoted by Sub(α), and the set of variables occurring in variable bindings in an xregex α is denoted by Vars(α). If an xregex α contains neither variable references, nor variable bindings, we call α a proper regular expression.

In other words, we use the term “proper” to distinguish those expressions that are usually just called “regular expressions” from the more general extended regular expressions. We use the notation α+ as a shorthand for αα. Parentheses can be added freely. We may also omit parentheses and the concatenation operator, where we assume ∗ and + are taking precedence over concatenation, and concatenation precedes disjunction. Furthermore, we use Σ as a shorthand for the regular expression \(\bigvee _{a\in \Sigma } a\).

Before introducing the semantics of xregexes formally, we give an intuitive explanation. An expression of the form α = x{β} matches the same strings as β, but α additionally stores the matched string in the variable x. Using a variable reference &x, this string can then be repeated. For example, let α := (x{ Σ} ⋅ &x). The subexpression x{ Σ} matches any string wΣ and stores this match in x. The following variable reference &x repeats the stored w. Thus, α defines the (non-regular) copy-language {wwwΣ}.

The following definition of the semantics of xregexes is based on the semantics by Freydenberger [12], which is an adaption of the semantics from Câmpeanu et al. [5] (the former uses variables, the latter backreferences). In comparison to [12], the case for Kleene star has been changed, in order to make the definition compatible with the parse trees for regex formulas from Fagin et al. [7].

Definition 2.2

Let γ be an xregex over Σ and X. A γ-parse tree is a finite, directed, and ordered tree Tγ. Its nodes are labeled with tuples of the form (w, γ′) ∈ ( Σ × Sub(γ)). The root of every γ-parse tree Tγ is labeled (w, γ) with wΣ; and the following rules must hold for each node v of Tγ:
  1. 1)

    If v is labeled (w, a) with a ∈ ( Σ ∪ {ε}), then v is a leaf, and w = a.

     
  2. 2)

    If v is labeled (w, (β1β2)), then v has exactly one left child v1 and exactly one right child v2 with respective labels (w1, β1) and (w2, β2), and w = w1w2.

     
  3. 3)

    If v is labeled (w, (β1β2)), then v has a single child, labeled (w, β1) or (w, β2).

     
  4. 4)

    If v is labeled (w, β), then one of the following cases holds: (a) w = ε, and v is a leaf, or (b) w = w1w2wk for words w1, …, wkΣ+ (with k ≥ 1), and v has k children v1, …, vk (ordered from left to right) that are labeled (w1, β), …, (wk, β).

     
  5. 5)

    If v is labeled (w, x{β}), then v has a single child, labeled (w, β).

     
  6. 6)

    If v is labeled (w, &x), let ≺ denote the post-order of the nodes of Tγ (that results from a left-to-right, depth-first traversal). Then one of the following cases applies: (a) If there is no node v′ with v′ ≺ v that is labeled (w′, x{β′}) ∈ Σ × Sub(γ), then v is a leaf, and w = ε. (b) Otherwise, let v′ be the node with v′ ≺ v that is ≺-maximal among nodes labeled (w′, x{β′}). Then v is a leaf, and w = w′.

     
If the root of a γ-parse tree Tγ is labeled (w, γ), we call Tγ a γ-parse tree for w. If the context is clear, we omit γ and call Tγ a parse tree.
There is no parse tree for , and references to unbound variables (i. e., variables that were not assigned a value with a variable binding operator) default to ε. For an example of a parse tree, see Fig. 1.
Fig. 1

An α-parse tree for w, where α := &x ⋅ (x{(ab)}⋅&x) and w := abab. For these choices of α and w, this is the only possible parse tree

We use parse trees to define the semantics of xregexes:

Definition 2.3

An xregex γ recognizes the language \(\mathcal {L}(\gamma )\) of all wΣ for which there exists a γ-parse tree Tγ with (w, γ) as root label.

Example 2.4

Consider the xregexes α := x{Σ+}⋅(&x)+, β := x{Σ+}⋅&xx{Σ+}⋅&x, and γ := x{aa+}⋅(&x)+ for some aΣ.

Then \(\mathcal {L}(\alpha )=\{w^{n}\mid w\in \Sigma ^{+}, n\geq 2\}\), \(\mathcal {L}(\beta )=\{x_{1}x_{1}x_{2}x_{2}\mid x_{1},x_{2}\in \Sigma ^{+}\}\) , and \(\mathcal {L}(\gamma )=\{a^{n}\mid n\geq 2, \text {\textit {n} is not prime}\}.\)

2.2 Document Spanners

Let w := a1a2an be a word over Σ, with \(n\in \mathbb {N}\) and a1, …, anΣ. A span of w is an interval [i, j〉 with 1 ≤ ijn + 1 and \(i,j \in \mathbb {N}\). For each span [i, j〉 of w, we define a subword w[i,j := aiaj−1. In other words, each span describes a subword of w by its bounding indices. Two spans [i, j〉 and [i′, j′〉 of w are equal if and only if i = i′ and j = j′. These spans overlap if ii′ < j or i′ ≤ i < j′, and are disjoint, otherwise. The span [i, jcontains the span [i′, j′〉 if ii′ ≤ j′ ≤ j. The set of all spans of w is denoted by Spans(w).

Example 2.5

Let w := aabbcabaa. As |w| = 9, both [1, 3〉 and [8, 10〉 are spans of w, but [10, 11〉 is not. Although w[1,3〉 = w[8,10〉 = aa, the first two spans are not equal. Likewise, the two spans [3, 3〉 and [5, 5〉 are not equal, even though w[3,3〉 = w[5,5〉 = ε. The whole word w is described by the span [1, 10〉.

Definition 2.6

Let SVars be a fixed, infinite set of span variables, where Σ and SVars are disjoint. Let VSVars be a finite subset of SVars, and let wΣ. A (V, w)-tuple is a function μ: VSpans(w), that maps each variable in v to a span of w. If context allows, we write w-tuple instead of (V, w)-tuple. A set of (V, w)-tuples is called a (V, w)-relation.

As V and Spans(w) are finite, every (V, w)-relation is finite by definition. Our next step is the definition of document spanners, which map words w to (V, w)-relations:

Definition 2.7

Let V and Σ be alphabets of variables and symbols, respectively. A (document) spanner is a function P that maps every word wΣ to a (V, w)-relation P(w). Let V be denoted by SVars(P). A spanner P is n-ary if |SVars(P)| = n, and Boolean if SVars(P) = . For all wΣ, we say P(w) = True and P(w) = False instead of P(w) = {()} and P(w) = , respectively.

A w-tuple μP(w) is hierarchical if for all x, ySVars(P) at least one of the following holds: (1) The span μ(x) contains μ(y), (2) the span μ(y) contains μ(x), or (3) the spans μ(x) and μ(y) are disjoint. A spanner P is hierarchical if, for every wΣ, every μP(w) is hierarchical.

A spanner P is total on w if P(w) contains all w-tuples over SVars(P). Let YSVars be a finite set of variables. The universal spanner over Y is denoted by ΥY. It is the unique spanner P′ such that SVars(P′) = Y and P′ is total on every wΣ. Furthermore, a spanner P is hierarchical total on w if P(w) is exactly the set of all hierarchical w-tuples over SVars(P); and the universal hierarchical spanner over a set Y is the unique spanner \({\Upsilon }^{\mathbf {H}}_{Y}\) that is hierarchical total on every wΣ.

For two spanners P1 and P2, we write P1P2 if P1(w) ⊆ P2(w) for every wΣ, and P1 = P2 if P1(w) = P2(w) for every wΣ.

Hence, a spanner can be understood as a function that maps a word w to a set of functions, each of which assigns spans of w to the variables of the spanner. As Boolean spanners are functions that map words to truth values, they can be interpreted as characteristic functions of languages. For every Boolean spanner P, we define the language recognized by P as \(\mathcal {L}(P):=\{w\in \Sigma ^{*}\mid P(w)=\texttt {True}\}\). We extend this to arbitrary spanners P by \(\mathcal {L}(P):=\{w\in \Sigma ^{*}\mid P(w)\neq \emptyset \}\).

Definition 2.8

A regex formula is an xregex α over Σ and X := SVars such that α does not contain any variable references, and for every βSub(α) with β = γ, no subexpression of γ may be a variable binding.

In other words, a regex formula is a proper regular expression that is extended with variable binding operators, but these operators may not occur inside a Kleene star. We define SVars(γ) := Vars(γ) for all regex formulas γ.

To define the semantics of regex formulas, we use the definition of parse trees for xregexes, see Definition 2.2. Intuitively, the goal of this definition is that each occurrence of a variable x in a γ-parse tree is matched to the corresponding span. Here, two problems can arise. Firstly, a variable might not occur in the parse tree; for example, when matching the regex formula (x{a} ∨ bb) to the word bb. Secondly, a variable might be defined too often, as e. g. in the regex formula x{Σ+} ⋅ x{Σ+}. In order to avoid such problems, we introduce the notion of a functional regex formula.

Definition 2.9

Let γ be a regex formula. We call γ functional if for every wΣ and every γ-parse tree Tγ for w, for each variable xSVars(γ), there exactly one node of Tγ has a label of the form (v, x{β}), where v is a subword of w and β is a sub-regex formula of γ. The class of all functional regex formulas is denoted by RGX.

As shown in Proposition 3.5 in Fagin et al. [7], functionality has a straightforward syntactic characterization: Basically, variables may not be redeclared, variables may not be used inside of Kleene stars, and if variables are used in a disjunction, each side of a disjunction has to bind exactly the same variables. Consider the following example:

Example 2.10

The regex formula γ1 := (x{a} ∨ x{b}) is functional even though it contains two occurrences of variable definitions for x. There are just two γ1-parse trees, both of which only contain one node labeled (c, x{c}), where c ∈ {a, b}. As a trivial case, even γ2 := x{} is functional (as no γ2-parse tree exists). Furthermore, the regex formulas γ3 := x{(ab)} ⋅ x{b+} and γ4 := ax{b} are not functional. Finally, γ5 := x{a} is not a regex formula at all.

For functional regex formulas, we use parse trees to define the semantics:

Definition 2.11

Let γ be a functional regex formula and let T be a γ-parse tree for a word wΣ. For every node v of T, the subtree that is rooted at v naturally maps to a span p(v) of w. As γ is functional, for every xSVars(γ), exactly one node vx of T has a label that contains x. We define μT: SVars(γ) → Spans(w) by μT(x) := p(vx). Each γRGX defines a spanner [[γ]] by
$${\left[{\kern-2.3pt}[ \gamma \right]{\kern-2.3pt}]}(w):= \{\mu^{T}\mid \text{\textit{T} is a \(\gamma\)-parse tree for \textit{w}}\}$$
for each wΣ.

Example 2.12

Assume that a, bΣ. We define the regex formula
$$\alpha:= \Sigma^{*} \cdot x\{\mathtt{a}\cdot y\{\Sigma^{*}\} \cdot (z\{\mathtt{a}\}\mathbin{\vee} z\{\mathtt{b}\})\}\cdot\Sigma^{*}.$$
Let w := baaba. Then [[α]](w) consists of ([2, 4〉, [3, 3〉, [3, 4〉), ([2, 5〉, [3, 4〉, [4, 5〉), ([2, 6〉, [3, 5〉, [5, 6〉), ([3, 5〉, [4, 4〉, [4, 5〉), and ([3, 6〉, [4, 5〉, [5, 6〉).

For every wΣ, a spanner P defines a (V, w)-relation P(w). In order to construct more sophisticated spanners, we introduce spanner operators.

Definition 2.13

Let P, P1, P2 be spanners and let wΣ. The algebraic operators union, projection, natural join and selection are defined as follows.
Union:

Two spanners P1 and P2 are union compatible if SVars(P1) = SVars(P2), and their union (P1P2) is defined by SVars(P1P2) := SVars(P1) = SVars(P2) and (P1P2)(w) := P1(w) ∪ P2(w) for every wΣ.

Projection:

Let YSVars(P). The projection πYP is defined by SVars(πYP) := Y and πYP(w) := P(w)|Y for all wΣ, where P(w)|Y is the restriction of all w-tuples in P(w) to Y .

Natural join:

Let Vi := SVars(Pi) for i ∈ {1, 2}. The (natural) join (P1P2) of P1 and P2 is defined by SVars(P1P2) := SVars(P1) ∪ SVars(P2) and, for all wΣ, we define (P1P2)(w) as the set of all (V1V2, w)-tuples μ for which there exist (Vi, w)-tuples μ1 and μ2 with \({\mu }(w)|_{V_{1}} = \mu _{1}(w)\) and \({\mu }(w)|_{V_{2}} = \mu _{2}(w)\).

Selection:

Let R ⊆ (Σ)k be a k-ary relation over Σ. The selection operator ζR is parameterized by k variables x1, …, xkVars(P), written as \(\zeta ^{R}_{x_{1},\dots ,x_{k}}\). The selection\(\zeta ^{R}_{x_{1},\dots ,x_{k}} P\) is defined by \(\textsf{SVars}(\zeta ^{R}_{x_{1},\dots ,x_{k}} P) := {\textsf{SVars}\left (P\right )}\) and, for all wΣ, we define \(\zeta ^{R}_{x_{1},\dots ,x_{k}} P(w)\) as the set of all μP(w) for which \(\left (w_{\mu (x_{1})}, \dots , w_{\mu (x_{k})}\right ) \in R\).

Like [7], we mostly consider the string equality selection operator ζ=. Hence, unless otherwise noted, the term “selection” refers to selection by the n-ary string equality relation. Note that unlike selection (which compares strings), join requires that the spans are identical.

The join P1P2 of two spanners P1 and P2 is equivalent to the intersection P1P2 if SVars(P1) = SVars(P2), and to the Cartesian Product P1 × P2 if SVars(P1) and SVars(P2) are disjoint. Hence, if applicable, we write ∩ and × instead of ⋈.

For convenience, we may add and omit parentheses. We assume there is an order of precedence with projection and selection ranking over join ranking over union, e.g. we may write \(\pi _{Y} \zeta ^=_{x,y} P_{1} \cup P_{2} \bowtie P_{3}\) instead of \((\pi _{Y} \zeta ^=_{x,y} P_{1} \cup (P_{2} \bowtie P_{3}))\), where projection and selection are applied to P1, and the result is united with the join of P2 and P3.

Example 2.14

Let \(P_{1}:= \zeta ^=_{x,y} {\left [{\kern -2.3pt}[ x\{\Sigma ^{*}\} y\{\Sigma ^{*}\} \right ]{\kern -2.3pt}]}\) and \(P_{2}:= \zeta ^=_{x,y,z}{\left [{\kern -2.3pt}[x\{\Sigma ^{*}\} y\{\Sigma ^{*}\} z\{\Sigma ^{*}\} \right ]{\kern -2.3pt}]}\) . Then \(\mathcal {L}(P_{1})=\{ww\mid w\in \Sigma ^{*}\}\) , and the variables x and y refer to the span of the first and second occurrence of w, respectively. Analogously, \(\mathcal {L}(P_{2})=\{w^{3}\mid \in \Sigma ^{*}\}\) (and z refers to the third occurrence of w). Assume that we want to construct a spanner for the language {wnwΣ, n ∈ {2, 3}}. As P1 and P2 are not union compatible, we cannot simply define P1P2. Union compatibility can be achieved by projecting P2 onto the set of common variables (i. e., π{x,y}P2).

Definition 2.15

A spanner algebra is a finite set of spanner operators. If O is a spanner algebra, then RGXO denotes the set of all spanner representations that can be constructed by (repeated) combination of the symbols for the operators from O with regex formulas from RGX. For each operator oO and each spanner representation of the form oρ (if o is unary) or ρ1oρ2 (if o is binary), we define [[oρ]] := o[[ρ]] or [[ρ1oρ2]] := [[ρ1]] o [[ρ2]], respectively. Furthermore, [[RGXO]] is the closure of [[RGX]] under the spanner operators in O.

We define \(\mathcal {L}(\rho ):=\mathcal {L}([{\kern -2.3pt}[ \rho ]{\kern -2.3pt}])\) for every spanner representation ρ. Fagin et al. [7] refer to [[RGX]] as the class of hierarchical regular spanners and to [[RGX{π, ∪, ⋈}]] as the class of regular spanners. In addition to (hierarchical) regular spanners, Fagin et al. also introduced the so-called core spanners, which are obtained by combining regex formulas with the four algebraic operators projection, selection, union, and join – in other words, the class of core spanners is the class \([{\kern -2.3pt}[\textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}]{\kern -2.3pt}]\). Analogously, \(\textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}\) is the class of core spanner representations.

3 Expressibility Results

3.1 Pattern Languages

We begin our examination of the expressive power of core spanners by comparing them to one of the simplest mechanisms with repetition operators:

Definition 3.1

Let X be an infinite variable alphabet that is disjoint from Σ. A pattern is a word α ∈ (ΣX)+ that generates the language
$$\mathcal{L}(\alpha):=\{\sigma(\alpha)\mid \text{\(\sigma\) is a pattern substitution}\}, $$
where a pattern substitution is a homomorphism Σ : (ΣX)Σ with σ (a) = a for all aΣ. We denote the set of all variables in α by Vars(α).

Intuitively, a pattern α generates exactly those words that can be obtained by replacing the variables in α with terminal words homomorphically (i. e., multiple occurrences of the same variable have to be replaced in the same way). This type of pattern languages is also called erasing pattern language (cf. Jiang et al. [24]).

Example 3.2

Let x, yX and a, bΣ. The patterns α := xx and β := xaybx generate the languages \(\mathcal {L}(\alpha )=\{ww\mid w\in \Sigma ^{*}\}\) and \(\mathcal {L}(\beta )=\{v \mathtt {a} w \mathtt {b} v\mid v,w\in \Sigma ^{*}\}.\)

From every pattern α, we can straightforwardly construct an xregex for \(\mathcal {L}(\alpha )\). A similar observation holds for core spanners:

Theorem 3.3

There is an algorithm that, given a pattern α, computes in polynomial time\(\rho _{\alpha }\in \textsf{RGX}^{\{\zeta ^=\}}\)such that\(\mathcal {L}(\rho _{\alpha })=\mathcal {L}(\alpha )\).

Proof

Let α = α1αn with \(n\in \mathbb {N}_{>0}\) and α1, …, αn ∈ (ΣX). We rewrite α into a regex formula \(\hat {\alpha }\), by replacing the i-th occurrence of a variable x with a binding xi{Σ}. More formally, we define \(\hat {\alpha }:= \hat {\alpha }_{1}{\cdots } \hat {\alpha }_{n}\), where for each i ∈ {1, …, n}, the regex formula \(\hat {\alpha }_{i}\) is defined as follows:
  1. 1.

    If αi is a terminal (i. e., there is an aΣ with αi = a), let \(\hat {\alpha }_{i}:= a\).

     
  2. 2.

    If αi is the j-th occurrence of a variable xX in α, let \(\hat {\alpha }_{i}:= x_{j}\{\Sigma ^{*}\}\).

     
Hence, no variable occurs twice in \(\hat {\alpha }\); and as \(\hat {\alpha }\) contains no disjunctions on variables, \(\hat {\alpha }\) is functional.

We now define S to be a sequence of selections; where S contains exactly the selections \(\zeta ^=_{x_{1},\ldots ,x_{k}}\) for each xVars(α) with |α|x = k and k ≥ 2. In other words, for each x that occurs more than once in α, we include a selection of all xi.

Finally, we define \(\rho _{\alpha }:= S \hat {\alpha }\). It is easy to see that \(\mathcal {L}(\rho _{\alpha })=\mathcal {L}(\alpha )\): For every \(w\in \mathcal {L}(\alpha )\), we can use a pattern substitution σ with σ (α) to construct a corresponding w-tuple μ for ρα. Likewise, for every \(w\in \mathcal {L}(\rho _{\alpha })\), there exists a corresponding w-tuple μ from which we can reconstruct a pattern substitution σ with σ (α) = w: By the construction of ρα, for each pair of variables xi, xj in \(\hat {\alpha }\), the words \(w_{\mu (x_{i})}\) and \(w_{\mu (x_{j})}\) must be identical. This allows us to define \(\sigma (x):= w_{\mu (x_{1})}\). □

Example 3.4

Let x, y, zX, a, bΣ, and define the pattern α := xayybxzx. The construction in the proof of Theorem 3.3 leads to the spanner representation \(\zeta ^=_{x_{1},x_{2},x_{3}}\zeta ^=_{y_{1},y_{2}} \gamma \), where γ = x1{Σ}⋅ay1{Σ}⋅y2{Σ}⋅bx2{Σ}⋅z1{Σ}⋅x3{Σ}.

While the construction in the proof of Theorem 3.3 is so simple that it might not seem noteworthy, it will prove quite useful: In contrast to their simple definition, many canonical decision problems for them are surprisingly hard. Via Theorem 3.3, the corresponding lower bounds also apply to spanners, as we discuss in Sections 4.1 and 4.2.

3.2 Word Equations and Existential Concatenation Formulas

In this section, we introduce word equations, which are equations of patterns (cf. Definition 3.1) and can be used to define languages and relations, cf. Karhumäki et al. [26]:

Definition 3.5

A word equation is a pair η := (ηL, ηR) of patterns ηL and ηR. A pattern substitution σ is a solution of η if σ (ηL) = σ (ηR). We define Vars(η) := Vars(ηL) ∪ Vars(ηR). For k ≥ 1, a relation R ⊆ (Σ)k is defined by a word equation η := (ηL, ηR) if there exist variables x1, …, xkVars(η) such that \(R=\left \{\left (\sigma (x_{1}),\ldots ,\sigma (x_{k})\right )\mid \text {\(\sigma \) is a solution of \(\eta \)}\right \}.\)

We also write (ηL, ηR) as ηL = ηR. As we shall see just after the next definition both sides of the equation may have common variables. The following relations are well known examples of relations that are definable by word equations:

Definition 3.6

Over Σ, we define relations
$$\begin{array}{@{}rcl@{}} R_{\text{com}}&:=&\{(x,y)\mid \text{\(x,y\in\{u\}\text{} ^{*}\) for some \(u\in\Sigma^{*}\)}\},\\ R_{\text{cyc}}&:=&\{(x,y)\mid\text{\textit{x} is a cyclic permutation of \textit{y}}\}. \end{array} $$

As shown in Lothaire [30], the relation Rcom is defined by the equation xy = yx, and Rcyc is defined by the equation xz = zy.

Let R be a k-ary string relation, and let C be a class of spanners. We say that R is selectable by C, if for every spanner PC and every sequence of variables x = (x1, …, xk) with x1, …, xkSVars(P), the spanner \(\zeta ^{R}_{\mathbf {x}} P\) is also in C.

Proposition 3.7

The relationsRcomandRcycare selectable by core spanners.

Proof

Both parts of the proof use a technique from [7]. Let x = x1, ..., xk be a sequence of distinct span variables (k ≥ 1), and let X := {x1, …, xk}. The spanner \(\zeta ^{R}_{\mathbf {x}} {\Upsilon }_{X}\) is called the R-restricted universal spanner over x, and is denoted by \({\Upsilon }^{R}_{\mathbf {x}}\). According to Proposition 4.15 in [7], in order to show that a R is selectable by core spanners, it suffices to show that \({\Upsilon }^{R}_{\mathbf {x}}\) is a core spanner for every xSVarsk.

Rcyc: Note that for all x, yΣ, the word x is a cyclic permutation of y (and vice versa) if and only if there exist u, vΣ with x = uv and y = vu (see e. g. Lothaire [30]). Hence we can define the core spanner \(P_{\text {cyc}}:= \pi _{\{x,y\}} \hat {P}\), where
$$\hat{P}:=\zeta^=_{u_{1},u_{2}}\zeta^=_{v_{1},v_{2}} [{\kern-2.3pt}[ \alpha_{x}\times \alpha_{y} ]{\kern-2.3pt}], $$
and the regex formulas αx and αy are defined as
$$\begin{array}{@{}rcl@{}} \alpha_{x} &:=& \Sigma^{*} x\left\{u_{1}\{\Sigma^{*}\}\cdot v_{1}\{\Sigma^{*}\}\right\}\Sigma^{*},\\ \alpha_{y} &:=& \Sigma^{*} y\left\{v_{2}\{\Sigma^{*}\}\cdot u_{2}\{\Sigma^{*}\}\right\}\Sigma^{*}. \end{array} $$
In order to prove that \(P_{\text {cyc}}={\Upsilon }^{R_{\text {cyc}}}_{x,y}\), we first observe that, for every wΣ and every μPcyc(w), there exists a \(\hat {\mu }\in \hat {P}(w)\) with \(\mu (x)=\hat {\mu }(x)\) and \(\mu (y)=\hat {\mu }(y)\). The selections enforce \(u:= w_{\hat {\mu }(u_{1})}=w_{\hat {\mu }(u_{2})}\) and \(v:= w_{\hat {\mu }(v_{1})}=w_{\hat {\mu }(v_{2})}\). Hence, wμ(x) = uv and wμ(y) = vu, which means that (wμ(x), wμ(y)) ∈ Rcyc, and \(\mu \in {\Upsilon }^{R_{\text {cyc}}}_{x,y}(w)\). For the other direction, we can show analogously that every \(\mu \in {\Upsilon }^{R_{\text {cyc}}}_{x,y}(w)\) can be extended into a \(\hat {\mu }\in \hat {P}(w)\), which then proves μPcyc(w).
Rcom: This proof relies on another fact from combinatorics on words. For all x, yΣ, the equation xy = yx holds if and only if (x, y) ∈ Rcom (again, see Lothaire [30]). We define a core spanner \(P_{\text {com}}:= \pi _{\{x,y\}}\hat {P}\), where
$$\hat{P}:=\zeta^=_{r_{1},r_{2},r_{3},r_{4}}\zeta^=_{x,x_{2}}\zeta^=_{y,y_{2}}\zeta^=_{\hat{x},\hat{x}_{2}}\zeta^=_{\hat{y},\hat{y}_{2}} [{\kern-2.3pt}[ \alpha_{1}\times \alpha_{2} \times \alpha_{3}\times \alpha_{4} ]{\kern-2.3pt}], $$
and the regex formulas α1, …, α4 are defined as
$$\begin{array}{@{}rcl@{}} \alpha_{1} &:=& \Sigma^{*} x\left\{\hat{x}\{\Sigma^{*}\}\cdot r_{1}\{\Sigma^{*}\}\right\}\Sigma^{*},\\ \alpha_{2} &:=& \Sigma^{*} x_{2}\left\{r_{2}\{\Sigma^{*}\}\cdot \hat{x}_{2}\{\Sigma^{*}\}\right\}\Sigma^{*},\\ \alpha_{3} &:=& \Sigma^{*} y\left\{\hat{y}\{\Sigma^{*}\}\cdot r_{3}\{\Sigma^{*}\}\right\}\Sigma^{*},\\ \alpha_{4} &:=& \Sigma^{*} y_{2}\left\{r_{4}\{\Sigma^{*}\}\cdot \hat{y}_{2}\{\Sigma^{*}\}\right\}\Sigma^{*}. \end{array} $$
In order to prove that \(P_{\text {com}}={\Upsilon }^{R_{\text {com}}}_{x,y}\), first assume that μPcom(w) for some wΣ. Again, this means that there exists a \(\hat {\mu }\in \hat {P}(w)\) with \(\mu (x)=\hat {\mu }(x)\) and \(\mu (y)=\hat {\mu }(y)\). In a slight abuse of notation, we identify the variables \(x,\hat {x},y,\hat {y}\) with the corresponding subwords of w. In other words, we define \(x,\hat {x},y,\hat {y}\in \Sigma ^{*}\) by \(z := w_{\hat {\mu }(z)}\) for \(z\in \{x,\hat {x},y,\hat {y}\}\). Furthermore, let \(r=w_{\hat {\mu }(r_{1})}\). Due to the equality selections, we obtain the following word equations from α1 to α4:
$$\begin{array}{@{}rcl@{}} x &=& \hat{x} r = r \hat{x},\\ y &=& \hat{y} r = r \hat{y}. \end{array} $$
We explain this in detail for the first equation: First, note that due to the structure of α1, we know that \(w_{\mu (x)} = w_{\mu (\hat {x})}\cdot w_{\mu (r_{1})}\) holds. Likewise, the structure of α2 ensures that \(w_{\mu (x_{2})} = w_{\mu (r_{2})}\cdot w_{\mu (\hat {x}_{2})}.\) Due to the selections \(\zeta ^=_{r_{1},r_{2},r_{3},r_{4}}\), \(\zeta ^=_{x,x_{2}}\), and \(\zeta ^=_{\hat {x},\hat {x}_{2}}\), the latter can be expressed as \(w_{\mu (x)} = w_{\mu (r_{1})}\cdot w_{\mu (\hat {x})},\) and by combining the two equations while abusing the notation as explained above, we obtain \(x=\hat {x}r=r\hat {x}\). The second equation is obtained analogously.

As \(\hat {x}r = r \hat {x}\), there exists a word uΣ with \(r,\hat {x}\in \{u\}^{*}\). We choose the shortest u for which r ∈ {u}. Then, due to \(\hat {y} r = r \hat {y}\), we have that \(\hat {y}\in \{u\}^{*}\) holds as well. This implies x, y ∈ {u}, (wμ(x), wμ(y)) ∈ Rcom, and \(\mu \in {\Upsilon }^{R_{\text {com}}}_{x,y}(w)\). Again we can show analogously that every \(\mu \in {\Upsilon }^{R_{\text {com}}}_{x,y}(w)\) can be extended into a \(\hat {\mu }\in \hat {P}(w)\), which then proves μPcom(w). □

In particular, this means that we can add \(\zeta ^{R_{\text {com}}}\) and \(\zeta ^{R_{\text {cyc}}}\) to core spanner representations, without leaving the class \([{\kern -2.3pt}[\textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}]{\kern -2.3pt}]\).

Example 3.8

Define Limp := {wnw ∈ Σ+, n ≥ 2} and ρ := \({\zeta}^{R_{\text {com}}}_{x,y}\) (x+} ⋯ y+}). Then \(\mathcal {L}(\rho )=L_{\text {imp}}\).

This does not imply that Rcom can be used to select relations like Rpow := {(x, xn)∣n ≥ 0}. For example, if x := abab, then (x, y) ∈ Rcom holds for all y ∈ {ab}. The authors conjecture that Rpow is not selectable by core spanners.

Furthermore, the spanner that is constructed for Rcom in the proof of Proposition 3.7 is more complicated than the corresponding word equation xy = yx. In fact, we constructed both spanners not from the equations, but from a characterization of the solutions. This appears to be necessary, due the fact that spanners need to relate their variables to an input w, while word equations use their variables without such restrictions. We shall see in Theorem 3.13 that, if this is kept in mind, core spanners can be used to simulate word equations.

Before we consider this topic further, we examine how word equations can simulate spanners, as this shall provide useful insights on some question of static analysis in Section 4.2. One drawback of word equations is that they are unable to express many comparatively simple regular languages; like A for any non-empty AΣ (cf. Karhumäki et al. [26]). In order to overcome this problem, we consider the following extension:

Definition 3.9

Let η := (ηL, ηR) be a word equation. A regular constraints function1 is a function \({\mathcal {C}}\) that maps each xVars(η) to a nondeterministic finite automaton \({\mathcal {C}}(x)\). A solution σ of η is a solution of η under constraints\({\mathcal {C}}\) if \(\sigma (x)\in \mathcal {L}({\mathcal {C}}(x))\) holds for every xVars(η).

Hence, regular constraints restrict the possible substitutions of a variable x to a regular language \(\mathcal {L}(\mathcal {C}(x))\).

A syntactic extension of word equations is EC, the existential theory of concatenation, which is obtained by extending word equations with ∨, ∧, and existential quantification over variables. For example, Rcyc is expressed by the EC-formula
$$\varphi_{\text{cyc}}(x,y):= \exists z\colon (xz=zy). $$
Using appropriate coding techniques, one can transform every EC-formula into an equivalent word equation (see Diekert [6]). Although the transformations given in [6] can result in an exponential blowup, satisfiability of word equations and of EC-formulas can still be decided in PSPACE.

Like word equations, these formulas can be further extended by adding regular constraints. For each variable x and each nondeterministic finite automaton (NFA) A, the (regular) constraintLA(x) is satisfied for a solution σ if \(\sigma (x)\in \mathcal {L}(A)\). We call the resulting class of formulas ECreg, the existential theory of concatenation with regular constraints.

Example 3.10

Let A be an NFA with \(\mathcal {L}(A)=\{\mathtt {a}\mathtt {b}^{i}\mathtt {a}\mid i\geq 1\}\), and define the ECreg-formula φ(x, y) := ∃z: (LA(z) ∧ (∃z1, z2: x = z1zz2) ∧ (∃z1, z2: y = z1zz2)).

Then φ expresses the relation of all (x, y) that have a common subword z from \(\mathcal {L}(A)\).

Note that we intentionally use LA(x) for constraint symbols instead of \({\mathcal {C}}\), to emphasize the following distinction in the use of constraints: In word equations, every variable x is constrained to one language \(L({\mathcal {C}}(x))\). In contrast to this, an ECreg-formula can use multiple constraint symbols for one variable (e. g., in the form of \(L_{A}(x)\land L_{A^{\prime }}(x)\)), or none at all.

Using the same techniques as for EC, one can transform ECreg-formulas into equivalent word equations with regular constraints. Again, the construction can result in an exponential blowup, but satisfiability of ECreg-formulas can still be decided in PSPACE (cf. Diekert [6]).

In order to simulate core spanners with ECreg-formulas, we introduce the following definition:

Definition 3.11

Let P be a core spanner with SVars(P) = {x1, …, xn}, n ≥ 0, and let \(\varphi (x_{w},{x^{P}_{1}}, {x^{C}_{1}}, {\ldots } {x^{P}_{n}}, {x^{C}_{n}})\) be an ECreg-formula. We say that φ realizes P if, for all \(w, {w^{P}_{1}}, {w^{C}_{1}},\ldots ,{w^{P}_{n}},{w^{C}_{n}}\in \Sigma ^{*}\), we have that \(\varphi (w, {w^{P}_{1}}, {w^{C}_{1}},\ldots ,{w^{P}_{n}},{w^{C}_{n}})=\mathtt {True}\) holds if and only if there is a μP(w) with \({w^{P}_{k}} = w_{[1,i_{k}\rangle }\) and \({w^{C}_{k}} = w_{[i_{k},j_{k}\rangle }\) for each 1 ≤ kn, where [ik, jk〉 = μ(xk).

This definition uses the fact that spans are always defined in relation to a word w. Note that every span [i, j〉 ∈ Spans(w) is characterized by the words w[1,i and w[i,j. Hence, if μ ∈ [[ρ]](w), the ECreg-formula models μ(xk) = [ik,jk〉 by mapping xw to w, \({x^{P}_{k}}\) to \(w_{[1,i_{k}\rangle }\), and \({x^{C}_{k}}\) to \(w_{[i_{k},j_{k}\rangle }\). In the naming of the variables, C stands for content, and P for prefix. This allows us to model spanners in ECreg-formulas:

Theorem 3.12

There is an algorithm that, given\(\rho \in \textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}\), computes in polynomial time anECreg-formulaφρthat realizes [[ρ]].

Proof

Before presenting the construction that is the main part of proof, we briefly consider a technical detail of functional regex formulas. On an intuitive level, functional regex formulas guarantee that in each parse tree, every variable is assigned exactly once (hence, x{a} ⋅ x{a} is not functional). Consequently, it seems reasonable to conjecture that, if a functional regex formula contains a subformula of the form α1α2, then SVars(α1) ∩ SVars(α2) = must hold.

While this conjecture is true for regex formulas that do not contain , it does not hold in general. For example, consider α := α1α2 with α1 := x{a} and α2 := (x{} ∨ b). Then xSVars(α1) ∩ SVars(α2), but as x{} can never be part of the label of a parse tree, the regex formula α is functional.

In order to exclude these fringe cases and simplify the construction of ECreg-formulas, we introduce the following concept: A regex formula α is -reduced if α = , or if α does not contain any occurrence of . Using simple rewrite rules, we can observe the following. □

Claim 1

There is an algorithm that, given a regex formula α, computes in polynomial time an -reduced regex formula αR with [[αR]] = [[α]].

Proof

In order to compute αR, it suffices to rewrite α according to the following rewrite rules:
  1. 1.

    ε,

     
  2. 2.

    \((\hat {\alpha }\mathbin {\vee }\emptyset )\to \hat {\alpha }\) and \((\emptyset \mathbin {\vee }\hat {\alpha })\to \hat {\alpha }\) for all regex formulas \(\hat {\alpha }\),

     
  3. 3.

    \((\hat {\alpha }\cdot \emptyset )\to \emptyset \) and \((\emptyset \cdot \hat {\alpha })\to \emptyset \) for all regex formulas \(\hat {\alpha }\),

     
  4. 4.

    x{} → for all variables x.

     
As is never part of a parse tree, we can observe that for all regex formulas α and β, where β is obtained by applying any number of these rewrite rules, [[β]] = [[α]] holds. Furthermore, one can use these rules to convert α into an equivalent and -reduced αR in polynomial time: If α is stored in a tree structure, it suffices to apply all applicable rules in bottom-up manner. \(\square \) (Claim 1)

This allows us to proceed to the main part of the proof. Recall that our goal is a procedure that, given a \(\rho \in \textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}\) with SVars(ρ) = {x1, …, xn}, constructs an ECreg-formula \(\varphi _{\rho }(x_{w},{x^{P}_{1}}, {x^{C}_{1}}, {\ldots } {x^{P}_{n}}, {x^{C}_{n}})\) such that for all \(w, {w^{P}_{1}}, {w^{C}_{1}},\ldots , {w^{P}_{n}}, {w^{C}_{n}}\in \Sigma ^{*}\), we have that \(\varphi _{\rho }(w, {w^{P}_{1}}, {w^{C}_{1}},\ldots ,{w^{P}_{n}},{w^{C}_{n}})=\mathtt {True}\) holds if and only if there is some μP(w) with \({w^{P}_{k}} = w_{[1,i_{k}\rangle }\) and \({w^{C}_{k}} = w_{[i_{k},j_{k}\rangle }\) for each 1 ≤ kn, where [ik, jk〉 = μ(xk).

In fact, this μ is always uniquely defined by w, the \({w^{P}_{k}}\), and the \({w^{C}_{k}}\). Based on this, we introduce some notation that simplifies our reasoning. Given wΣ and μP(w), we define the (2n + 1)-tuple \(\mathbf {w}_{\mu }:= (w, {w^{P}_{1}}, {w^{C}_{1}},\ldots ,{w^{P}_{n}},{w^{C}_{n}})\) by \({w^{P}_{k}} := w_{[1,i_{k}\rangle }\) and \({w^{C}_{k}} := w_{[i_{k},j_{k}\rangle }\) as in the previous paragraph. For the other direction, we say that a (2n + 1)-tuple \(\mathbf {w}= (w, {w^{P}_{1}}, {w^{C}_{1}},\ldots ,{w^{P}_{n}},{w^{C}_{n}})\) over Σ is spanner compatible if, for all 1 ≤ kn the concatenated word \({w^{P}_{k}}\cdot {w^{C}_{k}}\) is a prefix of w. In this case, we define μw through μw(xk) = [ik, jk〉 with \(i_{k}:= |{w^{P}_{k}}|+1\) and \(j_{k}:=|{w^{P}_{k}} {w^{C}_{k}}|+1\) for 1 ≤ kn. Note that these are one-to-one conversions if w is fixed: Every μ defines its unique spanner compatible wμ, and every spanner compatible w defines its unique μw. We can now rephrase Definition 3.11 using this terminology, and observe that φρ realizes [[ρ]] if and only if the following two statements hold:
  1. 1.

    For all w ∈ (Σ)2n+1, we have that φρ(w) = True implies that w is spanner compatible and μwP(w).

     
  2. 2.

    If μP(w), then φρ(wμ) = True.

     
We now proceed to the most complicated part of this proof, the construction of ECreg-formulas from regex formulas. (The following sub-proof is rather lengthy, as it contains the full induction for the correctness proof. The main part of the proof continues on page 17).

Claim 2

There is an algorithm that, given a functional regex formula ρRGX, constructs in polynomial time an ECreg-formula φρ that realizes [[ρ]].

Proof

Due to Claim 1, we can assume without loss of generality that ρ is -reduced. We define φρ recursively as follows:
  1. 1.
    If ρ does not contain any variables (i. e., n = 0), ρ is a proper regular expression. Using canonical transformation techniques, we can construct in polynomial time a non-deterministic finite automaton A with \(\mathcal {L}(A)=\mathcal {L}(\rho )\), and we define
    $$\varphi_{\rho}(x_{w}):= L_{A}(x_{w}).$$
    Then φρ realizes [[ρ]], as φρ(w) = True holds if and only if \(w\in \mathcal {L}(A)=\mathcal {L}(\rho )\), which holds if and only if μw ∈ [[ρ]](w).
     
  2. 2.
    If ρ contains variables, we assume that SVars(ρ) = {x1, …, xn} with n ≥ 1. By definition of regex formulas, no variable of ρ may occur inside of a Kleene star. Hence, we can distinguish three cases:
    1. (a)
      ρ = ρ1ρ2, where ρ1, ρ2 are functional regex formulas with SVars(ρ1) = SVars(ρ2) = SVars(ρ). We define
      $$\begin{array}{@{}rcl@{}} &&{}\varphi_{\rho}\left( x_{w},{x^{P}_{1}}, {x^{C}_{1}}, \ldots, {x^{P}_{n}}, {x^{C}_{n}}\right):=\\ &&\qquad\left( \varphi_{\rho_{1}}\left( x_{w},{x^{P}_{1}}, {x^{C}_{1}}, \ldots, {x^{P}_{n}}, {x^{C}_{n}}\right)\mathbin{\vee} \varphi_{\rho_{2}}\left( x_{w},{x^{P}_{1}}, {x^{C}_{1}}, \ldots, {x^{P}_{n}}, {x^{C}_{n}}\right)\right). \end{array} $$
      The intuition behind this formula should be clear; we proceed directly to proving the correctness. Assume that \(\varphi _{\rho _{1}}\) and \(\varphi _{\rho _{2}}\) realize [[ρ1]] and [[ρ1]], respectively. We choose any wΣ. To show the direction from logic to spanners, we extend w into a tuple w. By definition, φρ(w) = True holds if and only if \(\varphi _{\rho _{i}}(\mathbf {w})=\mathtt {True}\) for an i ∈ {1, 2}. As \(\varphi _{\rho _{i}}\) realizes [[ρi]], the tuple w is spanner compatible, and μw ∈ [[ρi]](w) holds. For the other direction, we proceed analogously: If μ ∈ [[ρi]](w), then \(\varphi _{\rho _{i}}(\mathbf {w}_{\mu })=\mathtt {True}\); hence, φρ(wμ) = True. We conclude that φρ realizes [[ρ]].
       
    2. (b)
      ρ = ρ1ρ2, where ρ1, ρ2 are functional regex formulas with SVars(ρ1) ∪ SVars(ρ2) = SVars(ρ) and SVars(ρ1) ∩ SVars(ρ2) = . Without loss of generality, we can assume
      $$\begin{array}{@{}rcl@{}} {\textsf{SVars}\left( \rho_{1}\right)}&=&\{x_{1},\ldots,x_{m}\},\\ {\textsf{SVars}\left( \rho_{2}\right)}&=&\{x_{m+1},\ldots,x_{n}\} \end{array} $$
      with 0 ≤ mn. We define
      $$\begin{array}{@{}rcl@{}} &&\varphi_{\rho}\left( x_{w},{x^{P}_{1}}, {x^{C}_{1}}, \ldots, {x^{P}_{n}}, {x^{C}_{n}}\right) :=\\ &&\exists y_{1}, y_{2}, z^{P}_{m+1}, \ldots, {z^{P}_{n}}\colon \varphi_{I}\left( x_{w},{x^{P}_{1}}, {x^{C}_{1}}, \ldots, {x^{P}_{n}}, {x^{C}_{n}},y_{1},y_{2},z^{P}_{m+1}, \ldots, {z^{P}_{n}}\right), \end{array} $$
      where
      $$\begin{array}{@{}rcl@{}} &&\varphi_{I}(x_{w},{x^{P}_{1}}, {x^{C}_{1}}, \ldots, {x^{P}_{n}}, {x^{C}_{n}},y_{1},y_{2},z^{P}_{m+1}, \ldots, {z^{P}_{n}}):= \\ &&\qquad\qquad\qquad\left( {\vphantom{\underset{m+1\leq i \leq n}{\bigwedge}}}(x_{w} = y_{1}\cdot y_{2}) \wedge \varphi_{\rho_{1}}\left( y_{1},{x^{P}_{1}}, {x^{C}_{1}}, \ldots, {x^{P}_{m}}, {x^{C}_{m}}\right)\right.\\ &&\left.\qquad\qquad\wedge \varphi_{\rho_{2}}\left( y_{2},z^{P}_{m+1}, x^{C}_{m+1}, \ldots, {z^{P}_{n}}, {x^{C}_{n}}\right) \!\wedge \underset{m+1\leq i \leq n}{\bigwedge} \left( {x^{P}_{i}} = y_{1} \cdot {z^{P}_{i}}\right)\right). \end{array} $$
      The idea behind this formula is as follows: As ρ = ρ1ρ2, whenever [[ρ]](w) ≠ holds, w can be decomposed into w = w1w2, where w1 is parsed in ρ1, and w2 in ρ2. We store these words in the variables y1 and y2, respectively. For all variables in SVars(ρ1), the spans of the μ ∈ [[ρ1]](w1) are also spans in w (as w1 is a prefix of w). Hence, we can use the results from ρ1 unchanged. On the other hand, [[ρ2]](w2) determines spans in relation to w2. Hence, each span [i, j〉 ∈ Spans(w2) corresponds to the span [i + c, j + c〉 ∈ Spans(w), where c := |w1|. The variables \({z^{P}_{i}}\) represent the start of the span with respect to y2, and the conjunction of the equations \(({x^{P}_{i}} = y_{1} \cdot {z^{P}_{i}})\) converts these starts into spans with respect to xw.
      The correctness proof is a little lengthy, but straightforward. Assume that \(\varphi _{\rho _{1}}\) and \(\varphi _{\rho _{2}}\) realize [[ρ1]] and [[ρ2]]. Assume that φρ(w) = True for some tuple \(\mathbf {w}=(w,{w^{P}_{1}},{w^{C}_{1}},\ldots ,{w^{P}_{n}},{w^{C}_{n}})\). By definition of φρ, the tuple w can be extended into \(\mathbf {w}^{\prime }=(w,{w^{P}_{1}},{w^{C}_{1}},\ldots ,{w^{P}_{n}},{w^{C}_{n}},u_{1},u_{2},v^{P}_{m+1},\ldots ,{v^{P}_{n}})\) with φI(w′) = True. By observing the structure of φI, we obtain:
      1. i.

        w = u1u2,

         
      2. ii.

        \({w^{P}_{i}} = u_{1} \cdot {v^{P}_{i}}\) for m + 1 ≤ in,

         
      3. iii.
        \(\varphi _{\rho _{1}}(\mathbf {u}_{1})=\mathtt {True}\) and \(\varphi _{\rho _{2}}(\mathbf {u}_{2})=\mathtt {True}\), where
        $$\begin{array}{@{}rcl@{}} \mathbf{u}_{1} &:=& \left( u_{1}, {w^{P}_{1}}, {w^{C}_{1}}, \ldots, {w^{P}_{m}}, {w^{C}_{m}}\right),\\ \mathbf{u}_{2} &:=& \left( u_{2}, v^{P}_{m+1}, w^{C}_{m+1}, \ldots, {v^{P}_{n}}, {w^{C}_{n}}\right). \end{array} $$
         
      From this and our initial assumption, we can conclude that w is spanner compatible, and that \(\mu _{\mathbf {u}_{1}}\in [{\kern -2.3pt}[ \rho _{1} ]{\kern -2.3pt}](u_{1})\) and \(\mu _{\mathbf {u}_{2}}\in [{\kern -2.3pt}[ \rho _{2} ]{\kern -2.3pt}](u_{2})\) must hold. Thus, there exits corresponding parse trees T1 and T2 with respective root labels (u1, ρ1) and (u2, ρ2). We combine these into a new parse tree T by adding a new root node (w, ρ1ρ2) that has T1 as left and T2 as right child. As described in Definition 2.11, this tree T defines the w-tuple
      $$\mu^{T}(x_{k})\,=\,\left\{\begin{array}{ll}\left[\right.i_{k},j_{k}\rangle & \text{ if } 1\leq k\leq m \text{ and } \mu_{1}(x_{k})=\left[\right.i_{k},j_{k}\rangle,\\ \left[\right.i_{k}\,+\,|u_{1}|,j_{k}+|u_{1}|\rangle & \text{ if } m+1\leq k\leq n \text{ and } \mu_{2}(x_{k})=\left[\right.i_{k},j_{k}\rangle. \end{array}\right. $$
      In other words, for the variables x1 to xm, the w-tuple μT simulates μ1 in u1, the left part of w; and for the variables xm+1 to xn, it simulates μ2 in u2, the right part of w. Hence, all spans for the latter variables are shifted by |u1|. Using the equalities \({w^{P}_{i}} = u_{1} \cdot {v^{P}_{i}}\) from above, we obtain μT = μw, which concludes this direction of the correctness proof. The other direction proceeds analogously: Given μ ∈ [[ρ]], we can use the corresponding parse tree T to factorize w into u1 and u2. We then shift the spans of the variables xm+1 to xn by |u1|, and use this to obtain u2 with \(\varphi _{\rho _{2}}(\mathbf {u}_{2})=\mathtt {True}\). No effort is necessary for u1, and we can then combine u1 and u2 into a tuple w with φρ(w) = True and w = wμ. Thus, φρ realizes [[ρ]].
       
    3. (c)
      \(\rho = x\{\hat {\rho }\}\) for some x ∈ {x1, …, xn}, and \(\hat {\rho }\) is a functional regex formula with \(\textsf{SVars}(\hat {\rho }) = \textsf{SVars}(\rho )\setminus \{x\}\). Without loss of generality, let x = x1. We define
      $$\begin{array}{@{}rcl@{}} &&\varphi_{\rho}(x_{w},{x^{P}_{1}}, {x^{C}_{1}}, \ldots, {x^{P}_{n}}, {x^{C}_{n}}) :=\\ &&\qquad\qquad\quad\left( \left( {x^{P}_{1}}=\varepsilon\right) \!\wedge\! \left( {x^{C}_{1}} = x_{w}\right) \!\wedge \varphi_{\hat{\rho}}\left( x_{w},{x^{P}_{2}}, {x^{C}_{2}}, \ldots, {x^{P}_{n}}, {x^{C}_{n}}\right)\right). \end{array} $$
      The formula uses the fact that in this case, for each μ ∈ [[ρ]](w), we have that μ(x1) = [1,|w| + 1〉 must hold. This is encoded by \({x^{P}_{1}}=\varepsilon \) and \({x^{C}_{1}} = w\). For the correctness proof, assume that \(\varphi _{\hat {\rho }}\) realizes \([{\kern -2.3pt}[ \hat {\rho } ]{\kern -2.3pt}]\). Going from logic to spanners, assume that \(\mathbf {w}=(w,{w^{P}_{1}},{w^{C}_{1}},\ldots ,{w^{P}_{n}},{w^{C}_{n}})\) and φρ(w) = True. Due to the structure of the formula, we know that \({w^{P}_{1}} =\varepsilon \), \({w^{C}_{1}}=w\), and \(\varphi _{\hat {\rho }}(\hat {\mathbf {w}})=\mathtt {True}\) for \(\hat {\mathbf {w}}=(w,{w^{P}_{2}},{w^{C}_{2}},\ldots ,{w^{P}_{n}},{w^{C}_{n}})\). As \(\varphi _{\hat {\rho }}\) realizes \([{\kern -2.3pt}[ \hat {\rho } ]{\kern -2.3pt}]\), we know that \(\hat {\mathbf {w}}\) is spanner compatible, and \(\mu _{\hat {\mathbf {w}}}\in [{\kern -2.3pt}[ \hat {\rho } ]{\kern -2.3pt}](w)\). Due to this and the definition of ρ, we observe μ ∈ [[ρ]](w) for the w-tuple
      $$\mu(x_{k}):= \left\{\begin{array}{ll} \left[\right.1,|w|+1\rangle & \text{if } k=1,\\ \mu_{\hat{\mathbf{w}}}(x_{k}) & \text{if } k>1. \end{array}\right. $$
      As μ = μw, we conclude this direction of the proof. For the other direction, let μ ∈ [[ρ]](w). By definition, μ(x1) = [1,|w| + 1〉 and \(\hat {\mu }\in [{\kern -2.3pt}[ \hat {\rho } ]{\kern -2.3pt}]\) for \(\hat {\mu }=\mu |_{\{x_{2},\ldots ,x_{n}\}}\). Due to our initial assumption, \(\varphi _{\hat {\rho }}(\mathbf {w}_{\hat {\mu }})=\mathtt {True}\) must hold. Note that \(\mathbf {w}_{\hat {\mu }}=(w,{w^{P}_{2}},{w^{C}_{2}},\ldots ,{w^{P}_{n}},{w^{C}_{n}})\), and let \(\mathbf {w}:= (w,\varepsilon , w, {w^{P}_{2}},{w^{C}_{2}},\ldots ,{w^{P}_{n}},{w^{C}_{n}})\). Then φρ(w) = True; and as w = wμ, this concludes this direction. Thus, φρ realizes [[ρ]].
       
     
Finally, note that the size of φρ is polynomial in the size of ρ. More importantly, the construction of φρ follows the syntax of ρ, and does not requires expensive additional computations. Hence, φρ can be computed in polynomial time. \(\square \) (Claim 2)
Using Claim 2, we have the conversion for RGX, the class of (functional) regex formulas. As final step of the proof, we extended this to all core spanner representations (i. e., the full class \(\textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}\)). Consider an arbitrary core spanner representations \(\rho \in \textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}\) with SVars(ρ) = {x1, …, xn}, n ≥ 0. We distinguish the following cases:
  1. 1.

    ρ is a regex formula. This case is covered in Claim 2.

     
  2. 2.
    \(\rho = \pi _{Y} \hat {\rho }\), with Y = SVars(ρ) and \({\textsf{SVars}\left (\hat {\rho }\right )}\supseteq {\textsf{SVars}\left (\rho \right )}\). Assume without loss of generality that \(\textsf{SVars}(\hat {\rho })=\{x_{1},\ldots ,x_{n+m}\}\) with m ≥ 0. We define
    $$\begin{array}{@{}rcl@{}} &&\varphi_{\rho}\left( x_{w},{x^{P}_{1}}, {x^{C}_{1}}, \ldots, {x^{P}_{n}}, {x^{C}_{n}}\right) :=\\ &&\qquad\qquad\exists x^{P}_{n+1},x^{C}_{n+1},\ldots,x^{P}_{n+m},x^{C}_{n+m}\colon \varphi_{\hat{\rho}}\left( x_{w},{x^{P}_{1}}, {x^{C}_{1}}, \ldots, x^{P}_{n+m}, x^{C}_{n+m}\right) \end{array} $$
    Regarding the correctness, assume that \(\varphi _{\hat {\rho }}\) realizes \([{\kern -2.3pt}[ \hat {\rho } ]{\kern -2.3pt}]\). Hence, if \(\hat {\mu }\in [{\kern -2.3pt}[ \hat {\rho } ]{\kern -2.3pt}](w)\), we have \(\varphi _{\hat {\rho }}(\mathbf {w}_{\hat {\mu }})=\mathtt {True}\). This means that for \(\mu := \hat {\mu }|_{Y}\), we know that φρ(wμ) = True holds as well. Likewise, if φρ(w) = True, there exists an extension \(\hat {\mathbf {w}}\) of w with \(\mu _{\hat {\mathbf {w}}}\in [{\kern -2.3pt}[ \hat {\rho } ]{\kern -2.3pt}](w)\). As \(\hat {\mathbf {w}}\) is spanner compatible, so is w. Thus, we observe \(\mu _{\mathbf {w}}=\mu _{\hat {\mathbf {w}}}|_{Y}\) and μw ∈ [[ρ]](w). Hence, φρ realizes [[ρ]].
     
  3. 3.
    \(\rho = \zeta ^=_{\mathbf {x}} \hat {\rho }\), with \(\mathbf {x}\in ({\textsf{SVars}\left (\hat {\rho }\right )})^{m}\), 2 ≤ mn, and \(\textsf{SVars}(\rho )=\textsf{SVars}(\hat {\rho })\). Assume without loss of generality that x = (x1, …, xm). We define
    $$\begin{array}{@{}rcl@{}} &&\varphi_{\rho}\left( x_{w},{x^{P}_{1}}, {x^{C}_{1}}, \ldots, {x^{P}_{n}}, {x^{C}_{n}}\right) :=\\ &&\qquad\qquad\qquad\qquad\quad\left( \varphi_{\hat{\rho}}\left( x_{w},{x^{P}_{1}}, {x^{C}_{1}}, \ldots, {x^{P}_{n}}, {x^{C}_{n}}\right) \!\wedge\! \underset{2\leq i \leq m}{\bigwedge} \left( {x^{C}_{1}} \,=\, {x^{C}_{i}}\right)\right). \end{array} $$
    Recall that \(\zeta ^=_{x_{i},x_{j}}\) only checks whether \(w_{\mu (x_{i})}=w_{\mu (x_{j})}\) holds, not whether μ(xi) = μ(xj). This is equivalent to checking whether \({x^{C}_{i}}={x^{C}_{j}}\) holds.

    We only proof the correctness for m = 2, the other cases proceed analogously (or by reducing them to this binary case). Assume that \(\varphi _{\hat {\rho }}\) realizes \([{\kern -2.3pt}[ \hat {\rho } ]{\kern -2.3pt}]\). Let μ ∈ [[ρ]](w). Then \(w_{\mu (x_{1})}=w_{\mu (x_{2})}\) and \(\mu \in [{\kern -2.3pt}[ \hat {\rho } ]{\kern -2.3pt}](w)\) hold by definition. The latter implies \(\varphi _{\hat {\rho }}(\mathbf {w})=\mathtt {True}\). Together with the former and the structure of φρ, we conclude φρ(w) = True.

    For the other direction, let φρ(w) = True. By the structure of φρ, we know that \(\varphi _{\hat {\rho }}(\mathbf {w})=\mathtt {True}\) and \({w^{C}_{1}}={w^{C}_{2}}\). As \(\varphi _{\hat {\rho }}\) realizes \([{\kern -2.3pt}[ \hat {\rho } ]{\kern -2.3pt}]\), we have that w is spanner compatible, and \(\mu _{\mathbf {w}}\in [{\kern -2.3pt}[ \hat {\rho } ]{\kern -2.3pt}](w)\). Due to \({w^{C}_{1}}={w^{C}_{2}}\), this implies μw ∈ [[ρ]](w) and concludes the proof that φρ realizes [[ρ]].

     
  4. 4.
    ρ = (ρ1ρ2), with SVars(ρ1) = SVars(ρ2) = SVars(ρ). Let
    $$\begin{array}{@{}rcl@{}} &&\varphi_{\rho}\left( x_{w},{x^{P}_{1}}, {x^{C}_{1}}, \ldots, {x^{P}_{n}}, {x^{C}_{n}}\right) :=\\ &&\qquad\qquad\left( \varphi_{\rho_{1}}\left( x_{w},{x^{P}_{1}}, {x^{C}_{1}}, \ldots, {x^{P}_{n}}, {x^{C}_{n}}\right) \mathbin{\!\vee\!} \varphi_{\rho_{2}}\left( x_{w},{x^{P}_{1}}, {x^{C}_{1}}, \ldots, {x^{P}_{n}}, {x^{C}_{n}}\right)\right). \end{array} $$
    In this case, the construction and the correctness proof are identical to case 2a (disjunction) in the proof of Claim 2.
     
  5. 5.
    ρ = (ρ1ρ2) with SVars(ρ) = SVars(ρ1) ∪ SVars(ρ2). We assume without loss of generality that SVars(ρ1) = {x1, …, xl} and SVars(ρ2) = {xm, …, xn} with 0 ≤ ln, 1 ≤ mn + 1, and ml + 1. Note that this implies SVars(ρ1) ∩ SVars(ρ2) = {xm, …, xl}, and SVars(ρ1) ∩ SVars(ρ2) = if and only if m = l + 1. We define
    $$\begin{array}{@{}rcl@{}} &&\varphi_{\rho}\left( x_{w},{x^{P}_{1}}, {x^{C}_{1}}, \ldots, {x^{P}_{n}}, {x^{C}_{n}}\right) :=\\ &&\qquad\qquad\left( \varphi_{\rho_{1}}\left( x_{w},{x^{P}_{1}}, {x^{C}_{1}}, \ldots, {x^{P}_{l}}, {x^{C}_{l}}\right) \!\wedge\! \varphi_{\rho_{2}}\left( x_{w},{x^{P}_{m}}, {x^{C}_{m}}, \ldots, {x^{P}_{n}}, {x^{C}_{n}}\right)\right). \end{array} $$
    The definition of ⋈ requires that μ ∈ [[ρ]](w) holds if and only if there are μ1 ∈ [[ρ1]](w) and μ2 ∈ [[ρ2]](w) with μ1(xi) = μ2(xi) for all i ∈ {m, …, l}. For each of these variables xi, we have that \(\varphi _{\rho _{1}}\) and \(\varphi _{\rho _{2}}\) model the span with the same variables \({x^{P}_{i}}\) and \({x^{C}_{i}}\).

    To prove the correctness, assume that \(\varphi _{\rho _{1}}\) and \(\varphi _{\rho _{2}}\) realize [[ρ1]] and [[ρ2]], respectively. Let μ ∈ [[ρ]](w). Then there exist μ1 ∈ [[ρ1]](w) and μ2 ∈ [[ρ2]](w) with \(\mu _{1}=\mu |_{\{x_{1},\ldots ,x_{l}\}}\) and \(\mu _{2}=\mu |_{\{x_{m},\ldots ,x_{n}\}}\), which implies μ1(xk) = μ2(xk) for mkl. Now, in order to talk about the components of \( \mathbf {w}_{\mu _{1}}\) and \( \mathbf {w}_{\mu _{2}}\), we name the components of the tuples as \(\mathbf {w}_{\mu _{1}}=(w,{w^{P}_{1}},{w^{C}_{1}},\ldots ,{w^{P}_{l}},{w^{C}_{l}})\) and \(\mathbf {w}_{\mu _{2}}=(w,{w^{P}_{m}},{w^{C}_{m}},\ldots ,{w^{P}_{n}},{w^{C}_{n}})\). As μ1 and μ2 agree on their common variables, we can combine this to \(\mathbf {w}:= (w,{w^{P}_{1}},{w^{C}_{1}},\ldots ,{w^{P}_{n}},{w^{C}_{n}})=\mathbf {w}_{\mu }\). As each \(\varphi _{\rho _{i}}\) realizes [[ρi]], we know that \(\varphi _{\rho _{i}}(\mathbf {w}_{\mu _{i}})=\mathtt {True}\). Hence, φρ(wμ) = φρ(w) = True. This concludes this direction.

    For the other direction, assume that φρ(w) = True. Due to the structure of the formula, this implies \(\varphi _{\rho _{i}}(\mathbf {w}_{i})=\mathtt {True}\), where \(\mathbf {w}_{1}:= (w,{w^{P}_{1}},{w^{C}_{1}},\ldots ,{w^{P}_{l}},{w^{C}_{l}})\) and \(\mathbf {w}_{2}:= (w,{w^{P}_{m}},{w^{C}_{m}},\ldots ,{w^{P}_{n}},{w^{C}_{n}})\). As \(\varphi _{\rho _{i}}\) realizes [[ρi]], we know that wi is spanner compatible, and \(\mu _{\mathbf {w}_{i}}\in [{\kern -2.3pt}[ \rho _{i} ]{\kern -2.3pt}](w)\) . Due to the former, w is also spanner compatible. Due to the latter, we know that μw ∈ [[ρ]](w), as \(\mu _{\mathbf {w}}(x_{k})=\mu _{\mathbf {w}_{1}}(x_{k})=\mu _{\mathbf {w}_{2}}(x_{k})\) for all mkl. Hence, φρ realizes [[ρ]].

     
The formula φρ can be derived from ρ without requiring further computation, and its size is polynomial in the size of ρ. Hence, φρ can be constructed in polynomial time.

As we shall see in Section 4.2, this result allows us to find upper bounds on two problems from the static analysis of spanners. We now examine how spanners can simulate word equations (and, thereby, also ECreg-formulas). As discussed above, spanners need to relate their variables to an input word. Hence, we only state the following result, which is a weaker form of simulation than for the other direction:

Theorem 3.13

Every word equation η := (ηL, ηR) with regular constraints\({\mathcal {C}}\)can be converted effectively into aρRGX\(\textsf{RGX}^{\{\zeta ^=,\times \}}\)withSVars(ρ) ⊇ Vars(η) such that for allwΣ, there is a solution σ of η under constraints\(\mathcal {C}\)withw = σ(ηL) = σ(ηR) if and only if there is aμ ∈ [[ρ]](w) withσ(x) = wμ(x)for allxVars(η).

Proof

As each of the two sides of a word equation is a pattern, we can transform those into regex formulas by using the a slightly adapted version of the conversion procedure from the proof of Theorem 3.3. Only two changes are made. Firstly, instead of binding a variable x to some Σ, we respect the constraints by using a regular expression for the language \(\mathcal {L}({\mathcal {C}}(x))\). Secondly, in order to ensure SVars(ρ) ⊇ Vars(η), the first occurrence of a variable x is not represented by x1, but by x.

Assume that ηL = α1αm and ηR = αm+1αn with \(m,n\in \mathbb {N}\), m + 1 ≤ n, and α1, …, αn ∈ (ΣX). We construct regex formulas \(\hat {\eta }_{L}:= \hat {\alpha }_{1}{\cdots } \hat {\alpha }_{m}\) and \(\hat {\eta }_{R}:= \hat {\alpha }_{m+1}{\cdots } \hat {\alpha }_{n}\), where for each position in 1 ≤ in, we define \(\hat {\alpha }_{i}\) as follows:
  1. 1.

    If αi is a terminal (i. e., there is an aΣ with αi = a), let \(\hat {\alpha }_{i}:= a\).

     
  2. 2.
    If αi is a variable (i. e., there is an xX with αi = x), let γ be a regular expression with \(\mathcal {L}(\gamma )=\mathcal {L}(\mathcal {C}(x))\). Furthermore, let j := |α1αi|x.
    1. (a)

      If j = 1, define \(\hat {\alpha }_{i}:= x\{\gamma \}\)

       
    2. (b)

      If j ≥ 2, define \(\hat {\alpha }_{i}:= x_{j}\{\gamma \}\) (where xjSVars is a new variable).

       
     
This ensures that \({\textsf{SVars}\left (\hat {\eta }_{L}\right )}\) and \({\textsf{SVars}\left (\hat {\eta }_{R}\right )}\) are disjoint. We then construct a sequence S of string equality selections appropriately: For every xVars(η) with k := |ηLηR|x ≥ 2, the sequence S includes a selection \(\zeta ^=_{x,x_{2},\ldots ,x_{k}}\).

Finally, we define \(\rho := S(\hat {\eta }_{L}\times \hat {\eta }_{R})\).

In order to prove that this construction is correct, we show that for all wΣ, μ ∈ [[ρ]](w) holds if and only if there is a solution σ of η under constraints \({\mathcal {C}}\) with
  1. 1.

    w = σ (ηL) = σ(ηR), and

     
  2. 2.

    σ (x) = wμ(x) for all xVars(η).

     
We begin with the if-direction. Assume that σ is a solution of η under constraints \({\mathcal {C}}\). Let w := σ(ηL) (which implies w = σ(ηR), as σ is a solution of η). We use this to define a w-tuple μ as follows: Due to our construction, each variable \(\hat {x}\in {\textsf{SVars}\left (\rho \right )}\) corresponds to a uniquely defined αi with αi = x. If 1 ≤ im, then \(\hat {x}\) occurs in \(\hat {\eta _{L}}\), and if m + 1 ≤ in, then \(\hat {x}\) occurs in \(\hat {\eta _{R}}\). We now define \(\mu (\hat {x}):= [l,r\rangle \), where the choice of l and r depends on this distinction:
  • If \(\hat {x}\) occurs in \(\hat {\eta _{L}}\), let l := | σ(α1αi−1)| + 1 and r := | σ(α1αi)| + 1,

  • If \(\hat {x}\) occurs in \(\hat {\eta _{R}}\), let l := | σ(αm+1αi−1)| + 1 and r := | σ(αm+1αi)| + 1.

Either way, we know that \(w_{\mu (\hat {x})}=\sigma (x)\) holds, which implies \(w_{\mu (\hat {x})}\in \mathcal {L}({\mathcal {C}}(x))\). Analogously, we can use σ to construct parse trees for \((w,\hat {\eta }_{L})\) and \((w,\hat {\eta }_{R})\). This allows us to conclude \(\mu \in [{\kern -2.3pt}[ \hat {\eta }_{L}\times \hat {\eta }_{R} ]{\kern -2.3pt}](w)\). Furthermore, for every selection \(\zeta ^=_{x,x_{2},\ldots ,x_{k}}\) in S, we know from the construction that x and all xi (1 ≤ ik) refer to the same xVars(η), which means that \(w_{\mu (x)}=w_{\mu (x_{i})}=\sigma (x)\) holds. Hence, for each of these selections, \(\mu \in [{\kern -2.3pt}[ \hat {\eta }_{L}\times \hat {\eta }_{R} ]{\kern -2.3pt}](w)\) implies \(\mu \in [{\kern -2.3pt}[ \zeta ^=_{x,x_{2},\ldots ,x_{k}}(\hat {\eta }_{L}\times \hat {\eta }_{R}) ]{\kern -2.3pt}](w)\). Thus, \(\mu \in [{\kern -2.3pt}[ S(\hat {\eta }_{L}\times \hat {\eta }_{R}) ]{\kern -2.3pt}](w)\), which is equivalent to μ ∈ [[ρ]](w) and concludes this direction of the proof.

For the only if-direction, assume that μ ∈ [[ρ]](w). We now define a pattern substitution σ by σ(a) := a for all aΣ, and σ(x) := wμ(x) for all xVars(η). By our construction, μ(x) is derived from x{γ}, where \(\mathcal {L}(\gamma )=\mathcal {L}({\mathcal {C}}(x))\) must hold, which means that \(w_{\mu (x)}\in \mathcal {L}({\mathcal {C}}(x))\), and hence \(\sigma (x)\in \mathcal {L}({\mathcal {C}}(x))\). All that remains to be shown is that σ(ηL) = σ(ηR) = w. In order to prove this, we first define \(\hat {w}_{L} = \hat {w}_{1}{\cdots } \hat {w}_{m}\) and \(\hat {w}_{R} = \hat {w}_{m+1}{\cdots } \hat {w}_{n}\), where the \(\hat {w}_{i}\) with 1 ≤ in are defined as follows:
  1. 1.

    If αi = aΣ, let \(\hat {w}_{i}:= a\). Then \(\hat {w}_{i}=\hat {\alpha }_{i}\) and \(\hat {w}=\sigma (\alpha _{i})\) hold by definition.

     
  2. 2.
    If αi = xX, let j := |α1αi|x. We distinguish two cases.
    1. (a)

      If j = 1, let \(\hat {w}_{i} = w_{\mu (x)}\). Then \(\sigma (\alpha _{i})=\hat {w}_{i}\) holds by definition.

       
    2. (b)

      If j ≥ 2, let \(\hat {w}_{i} = w_{\mu (x_{j})}\). Observe that S contains the selection \(\zeta ^=_{x,x_{2},\ldots ,x_{k}}\). Hence, \(w_{\mu (x_{j})}=w_{\mu (x)}\) holds, which implies \(\sigma (\alpha _{i})=\hat {w}_{i}\).

       
     
Now note that the \(\hat {w}_{i}\) correspond to the labels of the parse trees that have root labels \((w,\hat {\eta }_{L})\) and \((w,\hat {\eta }_{R})\). Hence, \(\hat {w}_{L}=w\) and \(\hat {w}_{R}=w\) must hold. Furthermore, we have \(\hat {w}_{i}=\sigma (\alpha _{i})\) for all 1 ≤ im. This allows us to conclude
$$\begin{array}{@{}rcl@{}} \sigma(\eta_{L}) &=& \sigma(\alpha_{1}{\cdots} \alpha_{m}){\kern40pt} \sigma(\eta_{R}) = \sigma(\alpha_{m+1}{\cdots} \alpha_{n})\\ &=& \hat{w}_{1}{\cdots} \hat{w}_{m}= \hat{w}_{L},{\kern47.5pt} =\hat{w}_{m+1}{\cdots} \hat{w}_{n}=\hat{w}_{R}. \end{array} $$
We observe σ(ηL) = σ(ηR) = w, which concludes this direction of the proof. □

While this form of simulation is weaker (as w has to be present), it still shows that the constructed spanner is satisfiable if and only if the word equation (with constraints) is satisfiable. Furthermore, the computed (V, w)-relations encode solutions of the equation.

Example 3.14

Let a, bΣ and define η := (xy, yx) with \(\mathcal {L}({\mathcal {C}}(x))=\mathcal {L}(\mathtt {aab^{+}})\) and \(\mathcal {L}({\mathcal {C}}(y))=\Sigma ^{+}\). The construction from the proof of Theorem 3.13 results in
$$\rho:= \zeta^=_{x,x_{2}}\zeta^=_{y,y_{2}} (\hat{\eta}_{L}\times \hat{\eta}_{R}),$$
where \(\hat {\eta }_{L}:= x\{\mathtt {aab^{+}}\}\cdot y\{\Sigma ^{+}\}\) and \(\hat {\eta }_{R}:= y_{2}\{\Sigma ^{+}\}\cdot x_{2}\{\mathtt {aab^{+}}\}.\)

The only reason that this construction is not necessarily possible in polynomial time is that regular constraints are specified with NFAs, while core spanners use regular expressions, which can lead to an exponential increase in the size.

There is a similar construction that does not use the join operator: By adding new variables z1, z2, we can construct
$$\hat{\rho}:= \zeta^=_{x,x_{2}}\zeta^=_{y,y_{2}}\zeta^=_{z_{1},z_{2}}(z_{1}\{\hat{\eta}_{L}\}z_{2}\{\hat{\eta}_{R}\}), $$
which behaves almost like ρ; the only difference that the solution is encoded in w = σ(ηLηR), instead of σ (ηL).

3.3 Xregexes

As shown by Fagin et al. [7], there are languages that are recognized by xregexes, but not by core spanners. In order to prove this, [7] introduced the so-called “uniform-0-chunk”-language Luzc: Assuming 0, 1 ∈ Σ, Luzc is defined as the language of all w = s1ts2tsn−1tsn, where n > 0, s1, …, sn ∈ {1}+, and t ∈ {0}+. Then \(\mathcal {L}(\alpha _{\text {uzc}})=L_{\text {uzc}}\) holds for the xregex αuzc := 1+x{0} ⋅ (1+ ⋅& x) ⋅ 1+, but no core spanner recognizes Luzc.

Considering that the syntax of regex formulas does not allow the use of variables inside a Kleene star (or plus), this inexpressibility result might be considered expected, as αuzc has an occurrence of &x inside a Kleene star. This raises the question whether xregexes that restrict variables in a similar manner can still recognize languages that core spanners cannot. In order to examine this question, we define the following subclass of xregexes:

Definition 3.15

An xregex α is variable star-free (short: vstar-free) if, for every βSub(α) with β = γ, no subexpression of γ is a variable binding or a variable reference. We denote the class of all vstar-free xregexes by vsfXR.

As we shall see in Theorem 3.21 below, every language that is recognized by a vstar-free xregex is also recognized by a core spanner. While this observation might be considered not very surprising, its proof needs to deal with some technicalities. In particular, one needs to deal with expressions like α := x{Σ}⋅ (&x ∨& x&x). A conversion in the spirit of Theorem 3.3 would need to replace the &x with distinct variables and ensure equality with selections; but as the disjunction contains subexpressions with distinct numbers of occurrences of &x, we would not be able to ensure functionality of the resulting regex formula. We avoid these problems by working with the following syntactically restricted class of vstar-free xregexes:

Definition 3.16

An αvsfXR is an xregex path if, for every βSub(α) with β = (γ1γ2), no subexpression of γ1 or γ2 is a variable binding or a variable reference. We denote the class of all xregex paths by XRP.

Intuitively, an xregex path αXRP can be understood as a concatenation α = α1αn, where each αi is either a proper regular expression, a variable reference, or a variable binding of the form \(\alpha _{i} = x\{\hat {\alpha }\}\), where \(\hat {\alpha }\) is also an xregex path. By “multiplying out” disjunctions that contain variables, we can convert every vstar-free xregex into a disjunction of xregex paths.

Lemma 3.17

There is an algorithm that, givenαvsfXR, computesα1, …, αnXRPwith\(\mathcal {L}(\alpha )=\bigcup _{i=1}^{n}\mathcal {L}(\alpha _{i})\).

Proof

If a vstar-free xregex α is not an xregex path, there exists at least one xVars(α) and at least one subexpression βSub(α) with βα such that
  1. 1.

    β is a disjunction; i. e., β = (γ1γ2) for some γ1, γ2vsfXR,

     
  2. 2.

    β contains a variable binding x{⋯} or a variable reference &x.

     
We now rewrite α into two vstar-free xregexes α1 and α2, by replacing β with γ1 or γ2, respectively. We observe that this rewriting step does not change the language: □

Claim 1

\(\mathcal {L}(\alpha )=\mathcal {L}(\alpha _{1})\cup \mathcal {L}(\alpha _{2})\)

Proof

If \(w\in \mathcal {L}(\alpha )\), there exists an α-parse tree T for w; in other words, the root of T is labelled with (w, α). Recall that α is vstar-free. Hence, we know that T uses the occurrence of β that was rewritten to create α1 and α2 at most once (in order to be able to use the occurrence multiple times, α would need to contain a star around β).

This allows us to distinguish two possibilities: If T does not use this occurrence of β at all, we can immediately transform T into an αi-parse tree Ti (i ∈ {1, 2}) by replacing the root label with (w, αi), and changing all children accordingly. Hence, \(w\in \mathcal {L}(\alpha _{i})\) holds.

On the other hand, if T uses this occurrence of β, then there exists a uniquely defined node v in T that is labeled with \((\hat {w},\beta )\) for some word \(\hat {w}\in \Sigma ^{*}\). Furthermore, this node corresponds to the occurrence of β that was rewritten in α1 and α2. By definition, v has exactly one child \(\hat {v}\) that is labeled with either \((\hat {w},\gamma _{i})\), where i ∈ {1, 2}. We rewrite T into a αi-parse tree Ti by removing v (i. e., \(\hat {v}\) replaces v), relabeling the root of T to (w, αi), and changing all labels between the root and \(\hat {v}\) accordingly. As Ti is a αi-parse tree for w, we have that \(w\in \mathcal {L}(\alpha _{i})\) holds. This proves \(\mathcal {L}(\alpha )\subseteq \mathcal {L}(\alpha _{1})\cup \mathcal {L}(\alpha _{2})\).

In order to prove \(\mathcal {L}(\alpha )\supseteq \mathcal {L}(\alpha _{1})\cup \mathcal {L}(\alpha _{2})\), we proceed analogously: If \(w\in \mathcal {L}(\alpha _{1})\cup \mathcal {L}(\alpha _{2})\), we can transform a αi-parse tree for w into an α-parse tree by inserting a node \((\hat {w},\beta )\) (if necessary), and changing the labels accordingly. \(\square \) (Claim 1)

Note that this equivalence relies on the fact that α is vstar-free, which implies that β does not occur inside a Kleene star. For xregexes that are not vstar-free, we can only conclude \(\mathcal {L}(\alpha )\supseteq \mathcal {L}(\alpha _{1})\cup \mathcal {L}(\alpha _{2})\). This is easily seen considering the example of x{a}y{b}(&x ∨ &y), which would be rewritten to x{a}(&x) and y{b}(&y).

We repeat this rewriting procedure on every created vstar-free xregex that is not an xregex path. This procedure terminates, as every rewriting removes a disjunction that contains at least one variable (binding or reference). Hence if α contains \(k\in \mathbb {N}_{>0}\) disjunctions, this process results in xregex paths α1, …, αn for some n ≤ 2k, and \(\mathcal {L}(\alpha )=\bigcup _{i=1}^{n}\mathcal {L}(\alpha _{i})\).

Example 3.18

Let α := x{Σ}⋅& x ⋅ (x{Σ} ∨ y{Σ}) ⋅ (&x ∨ &y) ⋅ &x. Multiplying out the disjunctions, we obtain the following xregex paths:
$$\begin{array}{@{}rcl@{}} \alpha_{1} &=& x\{\Sigma^{*}\}\cdot\&x\cdot x\{\Sigma^{*}\}\cdot \&x\cdot\&x,\\ \alpha_{2} &=& x\{\Sigma^{*}\}\cdot\&x\cdot x\{\Sigma^{*}\} \cdot \&y\cdot\&x,\\ \alpha_{3} &=& x\{\Sigma^{*}\}\cdot\&x\cdot y\{\Sigma^{*}\}\cdot \&x\cdot\&x,\\ \alpha_{4} &=& x\{\Sigma^{*}\}\cdot\&x\cdot y\{\Sigma^{*}\}\cdot \&y\cdot\&x. \end{array} $$
Then \(\mathcal {L}(\alpha )=\bigcup _{i=1}^{4}\mathcal {L}(\alpha _{i})\).

This transformation process might result in an exponential number of xregex paths; but as efficiency is not of concern right now, this is not a problem (the followup paper Freydenberger [13] shows that this blowup can be avoided with a more involved construction). Each of these xregex paths is then transformed into a functional regex formula:

Lemma 3.19

There is an algorithm that, givenαXRP, computes\(\rho \in \textsf{RGX}^{\{\pi ,\zeta ^=\}}\)with\(\mathcal {L}(\rho )=\mathcal {L}(\alpha )\).

Proof

Before we start with the proof, note that we can safely assume that α does not contain : If occurs inside a Kleene star (or a disjunction), that Kleene star (or disjunction) cannot contain any variable bindings or references, as α is an xregex path. Hence, we can remove as in the proof of Theorem 3.12. All other occurrences of imply \(\mathcal {L}(\alpha )=\emptyset \) – in this case, we are done.

Our goal is to rewrite the xregex path α into an equivalent core spanner of the form πSδ, where δ is a regex formula, and S is a sequence of string equality selections.

The main idea of the construction is quite straightforward: We basically replace each variable reference &x with a unique xi{Σ}, and use a string equality \(\zeta ^=_{x,x_{i}}\) to connect xi with the appropriate binding. The only technical problem is that unlike regex formulas, xregexes allow variables to be bound multiple times. We solve this by using a unique variable for every occurrence of a variable binding in α.

As explained above, the xregex path α can be understood as a concatenation α = α1αn, where each αi is either a proper regular expression, a variable reference, or a variable binding of the form \(\alpha _{i} = x\{\hat {\alpha }\}\), where \(\hat {\alpha }\) is also an xregex path.

Now, if we choose any occurrence of a variable reference &x in α, exactly one of the following two cases applies:
  1. 1.

    There is no binding x{} in α that to the left of that occurrence of &x, or

     
  2. 2.

    there is a binding x{} in α that is to the left of that occurrence of &x.

     
In the first case, this &x will always default to ε, which means that we can safely replace it with ε.

In the second case, we see that this &x will always refer to the variable binding x{} that is closest to it to the left in α. In other words, we can simply read α from left to right. All &x before the first binding for x default to ε; and all &x after the first binding for x refer to the most recent binding for x (recall that, according to our definition of xregexes, no variable binding for a variable x may contain another binding of x).

This allows us to rewrite α into an xregex path γ with \(\mathcal {L}(\gamma )=\mathcal {L}(\alpha )\) such that no occurrence of a variable reference &x in γ refers to the default value ε, and every variable binding x{⋯} occurs at most once. This is done the following way: We read α from left to right. If we encounter a reference &x for which no binding has been seen, we replace it with ε. If we encounter a binding x{} that has already been seen before, we replace it with a binding for a new variable \(\hat {x}\), and all occurrences of &x are renamed to \(\&\hat {x}\). (Of course, further occurrences of x{} would require further new variables.) For example, the xregex path
$$\alpha_{2} = x\{\Sigma^{*}\}\cdot\&x\cdot x\{\Sigma^{*}\} \cdot \&y\cdot\&x $$
from Example 3.18 would be rewritten to
$$\gamma_{2} = x\{\Sigma^{*}\}\cdot\&x\cdot \hat{x}\{\Sigma^{*}\} \cdot \varepsilon\cdot\&\hat{x}. $$
After rewriting α to γ, the next step is to transform γ into a regex formula δ by replacing all variable references in a manner that is similar to the proof of Theorem 3.3. More specifically, we construct δ by replacing, for each xVars(γ), the i-th occurrence of &x in γ with xi{Σ}. Note that δ is functional: Each variable in SVars(δ) appears exactly once in δ; and as δ is also an xregex path, this implies that every δ-parse tree contains every variable exactly once. (Recall that we assumed that α does not contain ; hence, neither do γ and δ.)

For every variable x for which there occur references &x in γ, we define a selection \(\zeta ^=_{V_{x}}\), where vx := {x} ∪ {xixi occurs in δ}. We let S denote a sequence of these selections (the order is irrelevant), and define the spanner representation ρ := πSδ. As we simulate the behavior of each variable binding x{⋯} and its references &x using the selection \(\zeta ^=_{V_{x}}\), it is easy to see that \(\mathcal {L}(\rho )=\mathcal {L}(\gamma )\) and, hence, \(\mathcal {L}(\rho )=\mathcal {L}(\alpha )\). □

Example 3.20

Consider the xregex path
$$\alpha:= \&x\cdot x\{\Sigma^{*}\cdot y\{\Sigma^{*}\}\}\cdot \&x\cdot \&y\cdot y\{\Sigma^{*}\}\cdot\&x\cdot\&y. $$
The construction from the proof of Lemma 3.19 leads to the equivalent xregex path
$$\gamma:= \varepsilon\cdot x\{\Sigma^{*}\cdot y\{\Sigma^{*}\}\}\cdot \&x\cdot \&y\cdot \hat{y}\{\Sigma^{*}\}\cdot\&x\cdot\&\hat{y}, $$
from which we derive the functional regex formula
$$\delta:= x\left\{\Sigma^{*} y\{\Sigma^{*}\}\right\} x_{1}\{\Sigma^{*}\} y_{1}\{\Sigma^{*}\} \hat{y}\{\Sigma^{*}\} x_{2}\{\Sigma^{*}\} \hat{y}_{1}\{\Sigma^{*}\}, $$
which we use in the spanner representation \(\rho := \pi _{\emptyset }\zeta ^=_{x,x_{1},x_{2}}\zeta ^=_{y,y_{1}}\zeta ^=_{\hat {y},\hat {y}_{1}} \delta .\) Then \(\mathcal {L}(\alpha )=\mathcal {L}(\rho )\).

As these spanner representations are Boolean, they are also union compatible. Hence, we can now combine Lemma 3.17 and Lemma 3.19 to observe the following.

Theorem 3.21

There is an algorithm that, givenαvsfXR, computes\(\rho \in \textsf{RGX}^{\{\pi ,\zeta ^=,\cup \}}\)with\(\mathcal {L}(\rho )=\mathcal {L}(\alpha )\).

In Section 4.2, we use Theorem 3.21 together with the undecidability results from [12] to obtain multiple lower bounds for static analysis problems. Theorem 3.21 also raises the question whether every language that is recognized by a core spanner is also recognized by a vstar-free regular expression. As we have already seen in Example 3.8, it is possible to express the language
$$L_{\text{imp}}:=\{w^{n}\mid w\in\Sigma^{+}, n\geq 2\} $$
with core spanners. Hence, under certain conditions, core spanners can simulate constructions like (&x).

While Limp might seem to be an obvious witness that separates the classes of languages that are recognized by core spanners and by vstar-free xregexes, proving this appears to be quite involved. Instead, we consider a related language, which allows us to use the following tool:

Definition 3.22

Let \(k\in \mathbb {N}_{>0}\). We call a set \(A \subseteq \mathbb {N}^{k}\)linear if there exist an r ≥ 0 and \(m_{0},\ldots ,m_{r}\in \mathbb {N}^{k}\) with \(A=\{m_{0}+ m_{1} i_{1} + m_{2} i_{2} + {\cdots } + m_{r} i_{r} \mid i_{1},i_{2},\ldots ,i_{r}\in \mathbb {N}\}\) . A set \(A \subseteq \mathbb {N}^{k}\) is semi-linear if it is a finite union of linear sets. Assume Σ = {a1, a2, …, ak} with |Σ| = k. The Parikh map\(\Psi \colon \Sigma ^{*}\to \mathbb {N}^{k}\) is defined by \(\Psi (w):= (|w|_{a_{1}},|w|_{a_{2}},\ldots ,|w|_{a_{k}})\), and is extended to languages by Ψ(L) := {Ψ(w)∣wL}. We call L semi-linear if Ψ(L) is semi-linear.

According to Parikh’s Theorem [32], every context-free language is semi-linear. Moreover, as shown by Ginsburg and Spanier [19], a set is semi-linear if and only if it is definable in Presburger arithmetic. Building on this, we state the following.

Theorem 3.23

For everyαvsfXR, the language\(\mathcal {L}(\alpha )\)is semi-linear.

Proof

In order to increase the readability, we prove the claim for the case |Σ| = 2 (the adaption to larger alphabets is obvious). We assume Σ = {a, b} and define Ψ(a) := (1, 0) and Ψ(b) := (0, 1). Assume that Vars (α) = {x1, …, xk} for some \(k\in \mathbb {N}_{>0}\).

It suffices to prove the claim for αXRP, as semi-linear sets are closed under union, and (according to Lemma 3.17) every vstar-free xregex is equivalent to a finite union of xregex paths.

As explained in the proof of Lemma 3.19 (in the construction of γ), we can also assume without loss of generality that every variable binding x{⋯} occurs exactly once in α, and that no variable reference &xi uses the default binding ε. In particular, this means that in every α-parse tree, each variable xi stores exactly one word wi.

Let α be an xregex path that satisfies these conditions. Our goal is to construct a Presburger formula φ such that \(\varphi(n^{\mathtt {a}},n^{\mathtt {b}})\) is true if and only if \((n^{\mathtt {a}},n^{\mathtt {b}})\in \Psi (\mathcal {L}(\alpha ))\). This formula will use variables \(x^{\mathtt {a}}_{i}\) and \(x^{\mathtt {b}}_{i}\) to represent |wi\(|_{\mathtt{a}}\) and |wi\(|_{\mathtt{b}}\), respectively. Recall that, due to our initial assumptions, each reference &xi refers to the same word wi; hence, we can safely define the corresponding variables \(x^{\mathtt {a}}_{i}\) and \(x^{\mathtt {b}}_{i}\) “globally” in φ.

Let I⊆{1, …, k}. We use x and xI as abbreviations for the sequences \(x^{\mathtt {a}}_{1},x^{\mathtt {b}}_{1}, \ldots \), \(x^{\mathtt {a}}_{k},x^{\mathtt {b}}_{k}\) and \(\left (x^{\mathtt {a}}_{i},x^{\mathtt {b}}_{i} : i \in I\right )\), and define
$$\varphi(n^{\mathtt{a}},n^{\mathtt{b}}):= \exists \mathbf{x}\colon \varphi_{\alpha}(n^{\mathtt{a}},n^{\mathtt{b}},\mathbf{x}), $$
where φα with Vars(α) = {x1, …, xk} is constructed according to the following general procedure.

Given an xregex path γ, we define a Presburger formula φγ as follows: First, as γ is an xregex path, there is a decomposition γ = γ1γ2γl(\(l\in \mathbb {N}_{>0}\)), where each γi is either a proper regular expression, a variable reference, or a variable binding of the form \(x\{\hat {\gamma _{i}}\}\) such that \(\hat {\gamma _{i}}\) is also an xregex path. For each γi, we use variables \(n^{\mathtt {a}}_{i}\) and \(n^{\mathtt {b}}_{i}\) to denote the number of a or b that occur in the subword that is generated by γi.

We denote the set of all variables that are bound or referenced in γi by
$${\textsf{VarsBR}\left( \gamma_{i}\right)} := {\textsf{Vars}\left( \gamma_{i}\right)} \cup \{x \mid \&x \text{ occurs in }\gamma_{i}\}. $$
In a slight abuse of notation, we identify \(\mathbf {x}_{\textsf{VarsBR}(\gamma _{i})}\) with the sequence (\(x^{\mathtt{a}}\), \(x^{\mathtt{b}}\): xVarsBR(γi)).
Keeping this in mind, we define
$$\begin{array}{@{}rcl@{}} &&\varphi_{\gamma}\left( n^{\mathtt{a}},n^{\mathtt{b}},\mathbf{x}_{{\textsf{VarsBR}\left( \gamma\right)}}\right):= \exists n^{\mathtt{a}}_{1}, n^{\mathtt{b}}_{1}, {\ldots} n^{\mathtt{a}}_{l}, n^{\mathtt{b}}_{l}\colon\\ &&\,\,\,\,\left( \left( n^{\mathtt{a}} = n^{\mathtt{a}}_{1} + {\cdots} + n^{\mathtt{a}}_{l}\right) \!\wedge\! \left( n^{\mathtt{b}} = n^{\mathtt{b}}_{1} + {\cdots} + n^{\mathtt{b}}_{l}\right) \!\wedge\! \bigwedge\limits_{i=1}^{l} \varphi_{\gamma_{i}}\left( n^{\mathtt{a}}_{i}, n^{\mathtt{b}}_{i},\mathbf{x}_{{\textsf{VarsBR}\left( \gamma_{i}\right)}}\right)\right), \end{array} $$
where the Presburger formulas are defined as follows:
  • If γi is a proper regular expression, then as \(\mathcal {L}(\gamma _{i})\) is semi-linear (as a consequence of Parikh’s theorem [32], every regular language is semi-linear). Hence, due to Ginsburg and Spanier [19], there is a Presburger formula \(\hat {\varphi }_{\gamma _{i}}\) such that \(\hat {\varphi }_{\gamma _{i}}(n^{\mathtt {a}}, n^{\mathtt {b}})\) is true if and only if \((n^{\mathtt {a}}, n^{\mathtt {b}})\in \Psi (\mathcal {L}(\gamma _{i}))\). We define
    $$\varphi_{\gamma_{i}}\left( n^{\mathtt{a}}_{i},n^{\mathtt{b}}_{i},\mathbf{x}_{\textsf{VarsBR}(\gamma_{i})}\right):= \hat{\varphi}_{\gamma_{i}}\left( n^{\mathtt{a}}_{i},n^{\mathtt{b}}_{i}\right). $$
    In order to avoid potential confusion, note that in this case \(\mathbf {x}_{\textsf{VarsBR}(\gamma _{i})}\) is the empty sequence. This is due to the fact that γi is a proper regular expression, which implies VarsBR(γi) = .
  • If γi = &xj for some 1 ≤ jl, we define
    $$\varphi_{\gamma_{i}}\left( n^{\mathtt{a}}_{i},n^{\mathtt{b}}_{i},\mathbf{x}_{{\textsf{VarsBR}\left( \gamma_{i}\right)}}\right):= \left( n^{\mathtt{a}}_{i}=x^{\mathtt{a}}_{j}\right) \wedge \left( n^{\mathtt{b}}_{i}=x^{\mathtt{b}}_{j}\right). $$
  • If γi = xj{δ} for some 1 ≤ jl and some xregex path δ, we define
    $$\varphi_{\gamma_{i}}\left( n^{\mathtt{a}}_{i},n^{\mathtt{b}}_{i},\mathbf{x}_{{\textsf{VarsBR}\left( \gamma_{i}\right)}}\right):= \left( n^{\mathtt{a}}_{i}=x^{\mathtt{a}}_{j}\right) \wedge \left( n^{\mathtt{b}}_{i}=x^{\mathtt{b}}_{j}\right) \wedge \varphi_{\delta}\left( n^{\mathtt{a}}_{i},n^{\mathtt{b}}_{i},\mathbf{x}_{\textsf{VarsBR}(\delta)}\right). $$
While the definition recurses in the case of xregex paths that contain variable bindings (the third case in the definition of \(\varphi _{\gamma _{i}}\) above), the formula φ is still ensured to be finite and well-defined (as δ is always a subexpression of γ and, hence, shorter).
Recall that by our initial assumption, for every variable xi, each variable reference &xi refers to the same word wi. Taking this into account, we can prove that
$$\Psi(\mathcal{L}(\alpha))=\{(n^{\mathtt{a}},n^{\mathtt{b}})\mid \varphi(n^{\mathtt{a}},n^{\mathtt{b}}) \text{ is true}\} $$
via a straightforward structural induction. □

We use Theorem 3.23 to separate the classes of languages that are recognized by core spanners and by vstar-free xregexes:

Lemma 3.24

LetLnsl := {(abm)nm, n ≥ 2} and\(\rho := \zeta ^{R_{\text {com}}}_{x,y} (x\{\mathtt {a}\mathtt {b}\mathtt {b}^{+}\}y\{\Sigma ^{+}\})\)forΣ := {a, b}. Then\(L_{\text {nsl}}=\mathcal {L}(\rho )\), but there is noαvsfXRwith\(\mathcal {L}(\alpha )=L_{\text {nsl}}\).

Proof

Assume that there is an αvsfXR with \(\mathcal {L}(\alpha )=L_{\text {nsl}}\). By Theorem 3.23, Lnsl must be semi-linear. Note that Ψ(Lnsl) = {(n, mn)∣m, n ≥ 2}. As semi-linear sets are closed under projection (cf. Ginsburg and Spanier [19]), this implies that the set C := {mnm, n ≥ 2} is semi-linear, and due to closure under complementation (also cf. [19]), the set P = {pp is prime, p = 0, or p = 1} is semi-linear as well. However, semi-linear sets are finite unions of linear sets, and so P contains a subset \(P_{c,a} := \{ c+an \mid n \in \mathbb {N}_{>0} \}\) of prime numbers for c ≥ 2 and a ≥ 2. Obviously, c + ac = c(1 + a) ∈ Pc, a, but c(1 + a) is a composite number. Hence, there is no αvsfXR with \(\mathcal {L}(\alpha ) = L_{\text {nsl}}\). □

We do not need the join operator to define non-semi-linear languages: Consider the core spanner representation ρ from Example 3.14 with \(\mathcal {L}(\rho )=L_{\text {nsl}}\). If we construct \(\hat {\rho }\) as explained below that example, we obtain \(\mathcal {L}(\hat {\rho })=\{ww\mid w\in L_{\text {nsl}}\}\), which is also not semi-linear.

It is worth pointing out Lemma 3.24 does not resolve the open question from [7] whether there is a language that is recognized by a core spanner, but not by an xregex, as Theorem 3.23 only applies to vstar-free xregexes. We have already seen languages that are not semi-linear, but are recognized by xregexes: The language Lnsl is recognized by αnsl := x{abb+}&x+; and a similar approach is used for the following language (which we already met in Example 2.4):

Example 3.25

Let Σ := {a}, and define the language Lnpr := {amnm, n ≥ 2}. In other words, Lnpr is the language of all words ai with i ≥ 4 such that i is not a prime number. Let αnpr := x{aa+} ⋅ (&x)+. Then \(\mathcal {L}(\alpha _{\text {npr}})=L_{\text {npr}}\).

While Lnsl and Lnpr are defined by very similar xregexes, the latter cannot be recognized by core spanners. In order to show this with a semi-linearity argument, we observe:

Theorem 3.26

Let |Σ| = 1 and let P be a core spanner overΣ. Then\(\mathcal {L}(P)\)is semi-linear.

Proof

We prove this by showing that on unary terminal alphabets, every ECreg-language is semi-linear. Due to Theorem 3.12, this proves the claim.

Let Σ = {a}, and consider any ECreg-formula φ(w) over Σ. We show that \(\mathcal {L}(\varphi )\) is semi-linear by converting φ into a Presburger formula \(\hat {\varphi }\) for the set \(\Psi (\mathcal {L}(\varphi ))=\{|w|\mid w\in \mathcal {L}(\varphi )\}\). We obtain \(\hat {\varphi }\) by rewriting φ in the following way:
  1. 1.

    Each quantifier ∃x is replaced with \(\exists \hat {x}\).

     
  2. 2.

    Each regular constraint LA(x) is replaced with a formula \(\hat {\varphi }_{A}(\hat {x})\) for the set \(\{|x|\mid x \in \mathcal {L}(A)\}\). As each \(\mathcal {L}(A)\) is a regular language, this is possible according to Ginsburg and Spanier [19].

     
  3. 3.

    Each word equation ηL = ηR is replaced with the equation sum(ηL) = sum(ηR), where the function sum is defined by sum(a) := 1, \(\text {sum}(x):= \hat {x}\) for xX, and sum(αβ) := sum(α) + sum(β).

     
For example, the word equation xaxyx = ayzzya is converted into the Presburger equation \(\hat {x} + 1 + \hat {x} + \hat {y}+ \hat {x} = 1 + \hat {y}+ \hat {z}+ \hat {z}+ \hat {y} + 1\) (for Σ = {a}). Intuitively, each variable \(\hat {x}\) in \(\hat {\varphi }\) contains the length of x in φ (which, as |Σ| = 1, corresponds to the Parikh image of that word). Hence, the Presburger formula \(\hat {\varphi }\) defines the set \(\Psi (\mathcal {L}(\varphi ))\). According to [19], this implies that \(\Psi (\mathcal {L}(\varphi ))\) is semi-linear, which means that \(\mathcal {L}(\varphi )\) is semi-linear. This concludes the proof. □

Note that this construction only applies to unary alphabets, as this is the only case where there is a one-to-one correspondence between words and their Parikh images.

Apart from the observation that Lnpr from Example 3.25 is not recognized by core spanners, Theorem 3.26 also allows us to conclude the following.

Corollary 3.27

If |Σ| = 1, then\(\mathcal {L}(P)\)is regular for every core spanner P.

In other words, for unary terminal alphabets, core spanners recognize exactly the same class as regular spanners, namely the class of regular languages (which, in the unary case, is identical to the class of context-free languages). Furthermore, Lemma 3.24 and Theorem 3.26 together show the following.

Corollary 3.28

The class of languages that is recognized by core spanners is not closed under homomorphisms.

We conclude this section with a summary of our insights into the relative expressive power of the various models. To increase readability, we use the following definitions: Let REG, XR, and PAT denote the class of regular expressions, xregex, or patterns, respectively. For a class of language recognizing mechanisms \(\mathcal {D}\), let \(\mathcal {L}(\mathcal {D})\) denote the class of languages that are recognized by elements of \(\mathcal {D}\). For example, \(\mathcal {L}(\textsf{PAT})\) is the class of pattern languages, and \(\mathcal {L}(\textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}})\) is the class of languages that are recognized by core spanners. The hierarchy in Fig. 2 is obtained by combining the results in the present section with the fact that every pattern language contains either exactly one or infinitely many words (first observed by Angluin [1]), and that there are regular languages that are not EC-recognizable (see Karhumäki et al. [26]). Two sets of question remain open: Firstly, although Theorem 3.26 together with Example 3.25 shows that there is a language that is recognized by xregex, but not by ECreg (and, hence, also not by EC or \(\textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}\)), it remains open whether the reverse direction holds as well. Secondly, although we know that \(\mathcal {L}(\textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}})\subseteq \mathcal {L}({\textsf{EC}^{\textsf{reg}}})\), we do not know whether this inclusion is strict. In fact, it even remains open whether there is a language that is recognized by EC, but not by \(\textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}\). This second set of question is discussed in more detail in Freydenberger [13].
Fig. 2

To the left: The relationship of the various language classes. An arrow denotes proper inclusion (of the source class in the target class), the dotted arrow denotes inclusion. To the right: The references for these results. See also the explanation at the end of Section 3

4 Decision Problems

4.1 Spanner Evaluation

We first examine the combined complexity of the evaluation problem for core spanners. To this end, we define the problem CSp−Eval: Given a core spanner representation \(\rho \in \textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}\), a word wΣ, and a (SVars(ρ), w)-tuple μ, is μ ∈ [[ρ]](w)? In order to prove lower bounds for this problem, we consider the membership problem for pattern languages: Given a pattern α and a word w, decide whether \(w\in \mathcal {L}(\alpha )\). As shown by Jiang et al. [24], this problem is NP-complete (for pattern languages that do not allow replacing variables with ε, this was already shown by Angluin [1]). Due to Theorem 3.3, we observe the following (the proof of NP-membership is straightforward).

Theorem 4.1

CSp−EvalisNP-complete,even if restricted to\(\textsf{RGX}^{\{\pi ,\zeta ^=\}}\).

Proof

In order to prove NP-hardness, it suffices to give a polynomial time reduction from the membership problem for pattern languages to CSp−Eval. Given a pattern α and a word w, we use Theorem 3.3 to construct a spanner representation \(\rho _{\alpha }\in \textsf{RGX}^{\{\zeta ^=\}}\) in polynomial time such that \(\mathcal {L}(\alpha )=\mathcal {L}(\rho _{\alpha })\). Next, we define ρ := πρα. As ρ represents a Boolean spanner, we define μ to be the empty tuple (). Now, μ ∈ [[ρ]](w) holds if and only if \(w\in \mathcal {L}(\alpha )\).

We prove membership in NP using the following NP-algorithm: Assume that we are given a core spanner representation ρ, a word wΣ, and a w-tuple μ. For every regex formula γ in ρ, we nondeterministically guess a w-tuple μγ. By definition, each of these tuples has a size that is polynomial in |w|. In addition to this, for every union (ρ1ρ2), we guess a representation ρi that is ignored. We then verify these guesses deterministically: First, we discard all parts of ρ that are ignored, and obtain a spanner representation \(\hat {\rho }\in \textsf{RGX}psj\). For all remaining regex formulas γ in \(\hat {\rho }\), we check whether μγ is consistent with γ and w. Obviously, this can be done in polynomial time. If all of these checks pass, we evaluate all operators in \(\hat {\rho }\). As \(\hat {\rho }\) contains no unions, the result of these evaluations is always either , or a set that contains exactly one w-tuple. Hence, this process only takes polynomial time. Furthermore, when it terminates, it results either in , or in a w-tuple \(\hat {\mu }\). In the latter case, we return True if \(\hat {\mu }=\mu \). □

The question arises whether there are natural restrictions to CSp−Eval that make this problem tractable. It appears that any subclass of the core spanners that extends regular spanners in a meaningful way while having a tractable evaluation problem cannot be allowed to recognize the full class of pattern languages.

For pattern languages, it was shown by Ibarra et al. [23] that bounding the number of variables in the pattern leads to an algorithm for the membership problem with a running time that is polynomial, although in \(\mathcal {O}(n^{k})\) (where n is the length of the word w, and k the number of variables). From a parameterized complexity point of view (see e. g. Grohe and Flum [20]), this is usually not considered satisfactory. Without going too much into details, in parameterized complexity, one generally considers parameterized problems tractable that belong to the class FPT (from fixed-parameter tractable). This class is defined as follows: The input of a parameterized problem is a pair (x, k), where x is the input of the non-parameterized problem (e. g., a pattern α and a word w), and k is a parameter of the input (e. g., the number of variables in α). The parameterized problem is in FPT if there exist a computable function f, a constant c ≥ 0, and an algorithm that decides the problem in time O(f(k)nc). We do not define the class W[1], but we note that the standard complexity theoretic assumption is that if a problem is W[1]-hard, it is not in FPT.

It was first observed by Stephan et al. [34] that the membership problem for pattern languages is W[1]-complete if the number of variable occurrences (not of variables) is used as a parameter (see Fernau et al. [11] for the full proof). As the number of variable occurrences in a pattern corresponds to the number of variables in an equivalent spanner, this implies that using the number of variables in a spanner as parameter leads to W[1]-hardness for this parameter of CSp−Eval.

Fernau and Schmid [10] and Fernau et al. [11] discuss these and various other potential restrictions to pattern languages that still do not lead to tractability (among these a bound on the length of the replacement of each variable, which corresponds to a bound on the length of spans). On the other hand, Reidenbach and Schmid [33] and Fernau et al. [9] examine parameters for patterns that make the membership problem tractable. While this does not directly translate to spanners, the authors consider these directions promising for further research.

But apart from these potential restrictions on the use of string equality, other restrictions are needed, as the use of join also makes evaluation intractable:

Proposition 4.2

CSp−EvalisNP-complete,even if restricted toRGX{π,⋈}.

Proof

We prove this with a reduction from the Clique problem: Given an undirected graph G = (V, E) and a number k ≤ |v|, decide whether G contains a clique of size k. This problem is NP-complete (cf. Garey and Johnson [18]). Consider an undirected graph G = (V, E) with V = {1, …, n} for some n ≥ 1, and a number kn. Let aΣ and define w := an and ρ := ⋈1≤i<jkαi,j, where each αi,j is defined by
$$\alpha_{i,j}:= \underset{\underset{u<v}{\{u,v\}\in E,}}{\bigvee}\:\mathtt{a}^{u-1}\: x_{i}\{\mathtt{a}\}\: \mathtt{a}^{v-u-1}\: x_{j}\{\mathtt{a}\}\: \mathtt{a}^{n-v}. $$
In other words, each part of the disjunction corresponds to a choice of u and v , which allows [[αi,j]](w) to map xi to the u-th and xj to the v-th letter of w. Then μ ∈ [[ρ]](w) holds if and only if there exist distinct nodes v1, …, vkV such that {vi, vj} ∈ E for all 1 ≤ i < jk; and μ(xi) = [vi, vi + 1〉 for 1 ≤ ik. Thus, the empty tuple is an element of [[πρ]](w) if and only if G contains a clique of size k. □

We also consider the data complexity of the evaluation problem for core spanners. For every core spanner representation ρ over Σ, we define the decision problem CSp−Eval(ρ): Given a word wΣ and a w-tuple μ, is μ ∈ [[ρ]](w)? Using a slight variation of the proof of Theorem 4.1, we obtain the following.

Theorem 4.3

CSp−Eval(ρ)is inNLOGSPACEfor every\(\rho \in \textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}\).

Proof

This result follows from a slight change to the NP-decision procedure from the proof of Theorem 4.1: We can represent the guessed w-tuples μγ for each regex formula γ by using two pointers for each μγ(x) = [i, j〉 (one pointer for i, one for j). As ρ is fixed, a finite number of such pointers suffices to represent all w-tuples. Furthermore, the verification of these guesses can also be realized nondeterministically with only a constant amount of additional pointers. □

4.2 Static Analysis

We consider the following common decision problems for core spanner representations, where the input is \(\rho \in \textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}\) or \(\rho _{1},\rho _{2}\in \textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}\):
  1. 1.

    CSp−Sat: Is [[ρ]](w) ≠ for some wΣ?

     
  2. 2.

    CSp−Hierarchicality: Is [[ρ]] hierarchical?

     
  3. 3.

    CSp−Universality: Is [[ρ]] = ΥSVars(ρ)?

     
  4. 4.

    CSp−Equivalence: Is [[ρ1]] = [[ρ2]]?

     
  5. 5.

    CSp−Containment: Is [[ρ1]] ⊆ [[ρ2]]?

     
  6. 6.

    CSp−Regularity: Is [[ρ]] ∈ [[RGX{π, ∪, ⋈}]]?

     
We approach the first two of these problems by using Theorem 3.12 to convert core spanner representations to ECreg-formulas, for which satisfiability is in PSPACE (cf. Diekert [6]). Hence, we observe:

Theorem 4.4

The problemCSp−SatisPSPACE-complete,even if it is restricted to spanner representationsfrom\(\textsf{RGX}^{\{\zeta ^=\}}\).

Proof

We begin with the upper bound. According to Theorem 3.12, for every core spanner representation ρ, there exists an ECreg-formula φ that realizes [[ρ]]. Furthermore, φ can be computed in polynomial time. In particular, φ is satisfiable if and only if ρ is satisfiable. As satisfiability for ECreg-formulas is in PSPACE (cf. Diekert [6]), this question can be answered in PSPACE.

For the lower bound, we construct a reduction to CSp−Sat from the intersection emptiness problem for regular expressions, which is defined as follows: Given (proper) regular expressions α1, …, αn, decide whether \(\bigcap _{i=1}^{n}\mathcal {L}(\alpha _{i})=\emptyset \). As a direct consequence of the proof of Lemma 3.2.3 in Kozen [27], this problem is PSPACE-complete (although Kozen’s proof uses automata, these are defined via regular expressions). Recall that every proper regular expression is also a functional regex formula. Hence, we can construct a Boolean spanner representation
$$\rho:= \zeta^=_{x_{1},\ldots,x_{n}}x_{1}\{\alpha_{1}\}{\cdots} x_{n}\{\alpha_{n}\}. $$
Obviously, for every wΣ, we have P(w) ≠ if and only if there exists a word vΣ with w = vn and \(v\in \mathcal {L}(\alpha _{i})\) for 1 ≤ in. Hence, P is satisfiable if and only if \(\bigcap _{i=1}^{n}\mathcal {L}(\alpha _{i})\neq \emptyset \). As PSPACE is closed under complementation, this proves PSPACE-hardness of CSp−Sat, even when restricted to representations from the class \(\textsf{RGX}^{\{\zeta ^=\}}\). □

The proof of the lower bound in Theorem 4.4 uses the PSPACE-hardness of the intersection emptiness problem for regular expressions. But even if the variables in the regex formulas were only bound to Σ, it follows from Theorem 3.13 that this problem would still be at least as hard as the satisfiability problem for word equations without constraints. Considering that even proving the decidability was hard (see Diekert [6] for an overview), approaching CSp−Sat without knowledge on word equations would have required enormous additional effort.

It is also possible to use ECreg-formulas to express a violation of the criteria for hierarchicality. This allows us to state the following result:

Theorem 4.5

The problemCSp−HierarchicalityisPSPACE-complete,even if it is restricted to\(\textsf{RGX}^{\{\zeta ^=,\times \}}\).

Proof

We begin with of the upper bound. The main idea is that non-hierarchicality can be expressed in ECreg-formulas. Hence, our goal is to construct a polynomial time procedure that, given a core spanner representation \(\rho \in \textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}\), constructs an ECreg-formula φNH that is satisfiable if and only if [[ρ]] is not hierarchical.

Recall that, by definition, for every spanner P and every word wΣ, a w-tuple μP(w) is not hierarchical if there exist variables x, ySVars(P) such that all of the following hold:
  1. 1.

    The span μ(x) does not contain μ(y),

     
  2. 2.

    the span μ(y) does not contain μ(x), and

     
  3. 3.

    the spans μ(x) and μ(y) overlap (i. e., they are not disjoint).

     
If this is the case, we say that μ(x) and μ(y) strictly overlap. It is easy to see that two spans [i1, j1〉 and [i2, j2〉 strictly overlap if one of the following strict overlap conditions is met:
  1. 1.

    i1 < i2 < j1 < j2,

     
  2. 2.

    i2 < i1 < j2 < j1.

     
For an illustration of these two conditions, see Fig. 3. Our next goal is to define an ECreg-formula φovl(xP, xC, yP, yC) that expresses the first condition when combined with an ECreg-formula that realizes a spanner (we do not need to define a formula for the second condition, as both conditions are symmetrical). To this purpose, we first define the ECreg-formula
$$\varphi_{\text{ppref}}(x,y):= \exists z: (L_{A}(z)\wedge (xz = y)),$$
where A is an NFA with \(\mathcal {L}(A)=\Sigma ^{+}\). Clearly, (x, y) ∈ Σ × Σ satisfies φppref if and only if x is a proper prefix of y. Next, we define
$$\begin{array}{@{}rcl@{}} &&\varphi_{\text{ovl}}(x^{P},x^{C},y^{P},y^{C}):=\\ &&\qquad\qquad\qquad\qquad\exists z_{1}, z_{2}:\ ((z_{1} = x^{P}x^{C})\wedge (z_{2} = y^{P} y^{C})\\ &&\qquad\qquad\qquad\qquad\qquad\qquad\wedge \varphi_{\text{ppref}}(x^{P},y^{P}) \wedge \varphi_{\text{ppref}}(y^{P},z_{1}) \wedge \varphi_{\text{ppref}}(z_{1},z_{2})). \end{array} $$
The idea behind the construction is as follows: Recall that this formula is going to be used together with an ECreg-formula that realizes a spanner. Hence, xP and xC represent a span [1 + |xP|, 1 + |xPxC|〉 = [i1, j1〉, while yP and yC represent a span [1 + |yP|, 1 + |yPyC|〉 = [i2, j2〉. In particular, xPxC and yPyC are both prefix of some common word w. Hence, i1 < i2 holds if and only if xP is a proper prefix of yP. Likewise, i2 < j1 and j1 < j2 hold if and only if yP is a proper prefix of xPxC, or xPxC is a proper prefix of yPyC, respectively.
Fig. 3

The two possibilities how two spans can strictly overlap (see proof of Theorem 4.5). To the left: i1 < i2 < j1 < j2. To the right: i2 < i1 < j2 < j1

In other words, φovl checks whether the first of the two strict overlap conditions is satisfied.

We are now ready to construct φNH. Let \(\rho \in \textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}\), and assume that SVars(ρ) = {x1, …, xn} for some n ≥ 2 (spanners with less than two variables are trivially hierarchical). Using Theorem 3.12), we then construct an ECreg-formula \(\varphi _{\rho }(x_{w}, {x^{P}_{1}}, {x^{C}_{1}}, \ldots , {x^{P}_{n}}, {x^{C}_{n}})\) that realizes [[ρ]]. We now define
$$\begin{array}{@{}rcl@{}} &&\varphi_{\text{NH}} := \exists x_{w}, {x^{P}_{1}}, {x^{C}_{1}}, \ldots, {x^{P}_{n}}, {x^{C}_{n}}: \\ &&\qquad\qquad\qquad\quad\left( \!\varphi_{\rho}\!\left( \!x_{w}, {x^{P}_{1}}, {x^{C}_{1}}, \ldots, {x^{P}_{n}}, {x^{C}_{n}}\right) \!\wedge\! \underset{\underset{i\neq j}{1\leq i,j\leq n;}}{\bigvee}\varphi_{\text{ovl}}\!\left( \!{x^{P}_{i}},{x^{C}_{i}},{x^{P}_{j}},{x^{C}_{j}}\!\right)\!\right)\!. \end{array} $$
Assume that [[ρ]] is not hierarchical. Then there exist a word wΣ, a w-tuple μ ∈ [[ρ]], and xl, xmSVars(ρ) such that μ(xl) and μ(xm) strictly overlap. As φρ realizes [[ρ]], we have that μ defines an assignment \((w, w_{[1,i_{1}\rangle }, w_{[i_{1},j_{1}\rangle }, \ldots , w_{[1,i_{n}\rangle }, w_{[i_{n},j_{n}\rangle })\) that satisfies this subformula (where [ik, jk〉 = μ(xk)). Furthermore, as μ(xm) and μ(xl) strictly overlap, either \(\varphi _{\text {ovl}}({x^{P}_{l}},{x^{C}_{l}},{x^{P}_{m}},{x^{C}_{m}})\) or \(\varphi _{\text {ovl}}({x^{P}_{m}},{x^{C}_{m}},{x^{P}_{l}},{x^{C}_{l}})\) is satisfied (if il < im or im < il, respectively). Hence, φNH is satisfiable.

Likewise, φNH is only satisfied if φρ and (at least) one \(\varphi _{\text {ovl}}({x^{P}_{l}},{x^{C}_{l}},{x^{P}_{m}},{x^{C}_{m}})\) are satisfied. This corresponds to a w-tuple μ where μ(xl) and μ(xm) strictly overlap. Hence, μ is not hierarchical, which means that [[ρ]] is not hierarchical.

Therefore, φNH is satisfiable if and only if [[ρ]] is not hierarchical. Furthermore, φNH can be constructed in polynomial time, as we only need to construct φρ (which is possible in polynomial time, according to the proof of Theorem 4.4), and an amount of φovl-formulas that is quadratic in |SVars(ρ)|, each of which has a constant length. Both constructions rely solely on the syntax of ρ, and require no further computation.

As satisfiability of ECreg-formulas can be decided in PSPACE, the complement of CSp−Hierarchicality is in PSPACE; and as PSPACE is closed under complementation, this means that CSp−Hierarchicality is in PSPACE.

For the lower bound, we slightly modify the proof of the lower bound for CSp−Sat. Again, we use the intersection emptiness problem for regular expressions. Given proper regular expressions α1, …, αn, we define
$$\rho:= \zeta^=_{x_{1},\ldots,x_{n}}(x_{1}\{\mathtt{a}\mathtt{a}\mathtt{a}\cdot\alpha_{1}\}{\cdots} x_{n}\{\mathtt{a}\mathtt{a}\mathtt{a}\cdot\alpha_{n}\}) \times (y\{\Sigma\cdot\Sigma^{+}\}\cdot\Sigma)\times (\Sigma\cdot z\{\Sigma^{+}\cdot\Sigma\}), $$
for some aΣ. By replacing each αi in that proof with aaaαi, we ensure that every word wΣ with \([{\kern -2.3pt}[{\zeta ^=_{x_{1},\ldots ,x_{n}}(x_{1}\{\mathtt {a}\mathtt {a}\mathtt {a}\cdot \alpha _{1}\}{\cdots } x_{n}\{\mathtt {a}\mathtt {a}\mathtt {a}\cdot \alpha _{n}\}}]{\kern -2.3pt}](w)\neq \emptyset \) has at least length 3 (which is the minimal word length for which non-hierarchical spanners are possible). Furthermore, for each such w, the variable y is assigned the span that contains all positions of w except the last one, and z is assigned the span that contains all positions except the first one. Hence, these spans strictly overlap, which means that ρ is not hierarchical. On the other hand, if \([{\kern -2.3pt}[ \zeta ^=_{x_{1},\ldots ,x_{n}}(x_{1}\{\mathtt {a}\mathtt {a}\mathtt {a}\cdot \alpha _{1}\}\cdots x_{n}\{\mathtt {a}\mathtt {a}\mathtt {a}\cdot \alpha _{n}\}) ]{\kern -2.3pt}](w)=\emptyset \) , then [[ρ]] = . Therefore, ρ is hierarchical if and only if there is no \(w\in \bigcap _{1\leq i\leq n}\mathcal {L}(\alpha _{i})\). As this problem is PSPACE-complete, CSp−Hierarchicality is PSPACE-hard. □

For the remaining problems, we use Theorem 3.21, and the fact that the undecidability results from Freydenberger [12] also hold for vstar-free xregexes:

Theorem 4.6

The problemsCSp−UniversalityandCSp−Equivalenceare not semi-decidable, but co-semi-decidable. The problemCSp−Regularityis neither semi-decidable, nor co-semi-decidable.These results hold even if the input is restrictedto\(\textsf{RGX}^{\{\pi ,\zeta ^=,\cup \}}\).

Proof

The co-semi-decidability of the first two problems is obvious. We discuss this for universality: For any core spanner representation ρ, we can always decide whether [[ρ]](w) = ΥSVars(ρ)(w) holds. Hence, we can semi-decide non-universality by enumerating all wΣ until we find a word w with [[ρ]](w) ≠ ΥSVars(ρ)(w). Thus, CSp−Universality is co-semi-decidable. The proof for CSp−Equivalence works analogously.

We now proceed to the proofs of the lower bounds. As shown by Freydenberger [12], if |Σ| ≥ 2, for xregexes α, the following holds:
  • It is not semi-decidable whether \(\mathcal {L}(\alpha )=\Sigma ^{*}\),

  • It is neither semi-decidable, nor co-semi-decidable whether \(\mathcal {L}(\alpha )\) is a regular language.

The proof in [12] takes a Turing machine \(\mathcal {X}\) (with some additional technical restrictions) and computes an xregex \(\alpha _{\mathcal {X}}\) with a single variable x such that \(\mathcal {L}(\alpha )=\Sigma ^{*}\) if and only if \(\mathcal {X}\) accepts no input, and \(\mathcal {L}(\alpha _{\mathcal {X}})\) is regular if and only if \(\mathcal {X}\) accepts only finitely many inputs.
These xregexes \(\alpha _{\mathcal {X}}\) are defined over the alphabet Σ = {0, #} and, when adapted to the notation of this paper, are always of the following shape:
$$\alpha_{\mathcal{X}}=\alpha_{struc}\mathbin{\vee}\alpha_{state}\mathbin{\vee}\alpha_{head}\mathbin{\vee}\alpha_{mod}\mathbin{\vee}\alpha_{var}. $$
It is important to note that all subexpressions except αvar are proper regular expressions, while
$$\alpha_{var}=(0\mathbin{\vee} \#)^{*}\#0\cdot x\{0^{*}\} \cdot(\alpha_{1}\mathbin{\vee} \alpha_{2} \mathbin{\vee} {\cdots} \mathbin{\vee} \alpha_{n}) $$
for some \(n\in \mathbb {N}p\) that depends on \(\mathcal {X}\), where all αi are xregex paths that do not contain variable bindings, and no other variable references than &x.

We note that the single variable biding x{0} and all variable references &x do not occur under a Kleene star, and conclude that \(\alpha _{\mathcal {X}}\) is a vstar-free xregex.

By Theorem 3.21, we can effectively convert every \(\alpha _{\mathcal {X}}\) into a Boolean spanner representation \(\rho _{\mathcal {X}}\in \textsf{RGX}^{\{\pi ,\zeta ^=,\cup \}}\) with \(\mathcal {L}(\rho _{\mathcal {X}})=\mathcal {L}(\alpha _{\mathcal {X}})\).

Then \([{\kern -2.3pt}[ \rho _{\mathcal {X}} ]{\kern -2.3pt}]={\Upsilon }_{\emptyset }\) holds if and only if \(\mathcal {L}(\alpha _{\mathcal {X}})=\Sigma ^{*}\). As this question is not semi-decidable, CSp−Universality is also not semi-decidable. As CSp−Universality is a special case of CSp−Equivalence, the latter problem is also not semi-decidable.

Furthermore, \([{\kern -2.3pt}[ \rho _{\mathcal {X}} ]{\kern -2.3pt}]\) is a regular spanner if and only if \(\mathcal {L}(\alpha _{\mathcal {X}})\) is a regular language (as shown by Fagin et al. [7], when viewed as language definition mechanisms, regular spanners define exactly the class of regular languages). This question is neither semi-decidable, nor co-semi-decidable; hence, this applies to CSp−Regularity as well. □

As the proof of Theorem 4.6 relies only on Boolean spanners, the decidability status of CSp−Regularity does not change if the problem asks for hierarchical regularity (i. e., membership in [[RGX]]) instead of regularity, as the two classes coincide for Boolean spanners. Likewise, CSp−Universality remains not semi-decidable if one replaces ΥSVars(ρ) with \({\Upsilon }^H_{\textsf{SVars}(\rho )}\).

In the construction from this proof, variables are only bound to a language a+. Hence, the same undecidability results hold for spanners that use selections by equal length relation, instead of the string equality relation. While the proof builds on xregexes \(\alpha _{\mathcal {X}}\) that use only a single variable x, the resulting core spanners use an unbounded amount of variables, as every occurrence of a variable reference &x in an xregex path is converted to a spanner variable xi. But undecidability remains even if we bound the number of variables in the spanners, as the \(\alpha _{\mathcal {X}}\) can be modified to use only a bounded number of variable references (see Section 4.1 in [12]). Theorem 4.6 also implies that CSp−Containment is not semi-decidable. This holds even for a more restricted class of spanners:

Theorem 4.7

The problemCSp−Containmentis not semi-decidable, even if it is restrictedto\(\textsf{RGX}^{\{\pi ,\zeta ^=\}}\).

Proof

This proof uses the undecidability of the inclusion problem for pattern languages, which is defined as follows: Given two patterns α and β, decide whether \(\mathcal {L}(\alpha )\subseteq \mathcal {L}(\beta )\).

For unbounded sizes of Σ, this undecidability was proven by Jiang et al. [25], and Freydenberger and Reidenbach [15] adapted this proof to all (non-unary) finite terminal alphabets.

Given two patterns α, β, we can use Theorem 3.3 to construct spanner representations \(\rho _{\alpha },\rho _{\beta }\in \textsf{RGX}^{\{\zeta ^=\}}\) with \(\mathcal {L}(\rho _{X})=\mathcal {L}(X)\) for X ∈ {α, β}, and turn these into representations of Boolean spanners \(\hat {\rho }_{X}:=\pi _{\emptyset }\rho _{X}\). Then \([{\kern -2.3pt}[ \hat {\rho }_{\alpha } ]{\kern -2.3pt}](w)\subseteq [{\kern -2.3pt}[ \hat {\rho }_{\beta } ]{\kern -2.3pt}](w)\) holds for all wΣ if and only if \(\mathcal {L}(\alpha )\subseteq \mathcal {L}(\beta )\).

This shows that CSp−Containment is not decidable. As it is obviously co-semi-decidable, this also shows that CSp−Containment is not semi-decidable. □

As shown by Bremer and Freydenberger [4], the inclusion problem for pattern languages remains undecidable if the number of variables in the patterns is bounded. In fact, that proof constructs patterns where even the number of variable occurrences is bounded. Therefore, CSp−Containment is not semi-decidable even if restricted to representations from \(\textsf{RGX}^{\{\pi ,\zeta ^=\}}\) with a bounded number of variables. It is a hard open question whether the equivalence problem for pattern languages is decidable (cf. Ohlebusch and Ukkonen [31], Freydenberger and Reidenbach [15]). Undecidability of this problem would imply undecidability of CSp−Equivalence, even if restricted to representations from \(\textsf{RGX}^{\{\pi ,\zeta ^=\}}\).

We conclude this part of the section with a table that summarizes our results on decision problems:

Problem

Status

Reference

CSp−Eval

NP-complete

Theorem 4.1, Proposition 4.2

CSp−Eval(ρ)

in NLOGSPACE

Theorem 4.3

CSp−Sat

PSPACE-complete

Theorem 4.4

CSp−Hierarchicality

PSPACE-complete

Theorem 4.5

CSp−Universality

co-semi-decidable, not semi-decidable

Theorem 4.6

CSp−Equivalence

co-semi-decidable, not semi-decidable

Theorem 4.6

CSp−Containment

co-semi-decidable, not semi-decidable

Theorem 4.7

CSp−Regularity

neither semi-, nor co-semi-decidable

Theorem 4.6

Details under which restrictions the lower bounds persist can be found in the respective results.

4.2.1 Minimization and Relative Succinctness

In order to address the minimization of spanner representations, we first formalize the notion of the size or complexity of a spanner representation. Even for proper regular expressions, there are various different definitions of size, see e. g. Holzer and Kutrib [22], and there might be convincing reasons to add additional weight to the number of variables or other parameters. As we shall see, these distinctions do not affect the negative results that we prove later. Hence, instead of defining a single fixed notion of size, we use the following general definition of complexity measures from Kutrib [29]:

Definition 4.8

Let SR be a class of spanner representations. A complexity measure for SR is a recursive function \(c\colon \textsf{SR}\to \mathbb {N}\) such that for each Σ, the set of all ρSR that represent spanners over Σ can be effectively enumerated in order of increasing c(ρ), and does not contain infinitely many ρSR with the same value c(ρ).

By recursive, we mean a function that is total and computable. Definition 4.8 is general enough to include all notions of complexity that take into account that descriptions are commonly encoded with a finite number of distinct symbols, and that it should be decidable if a word over these symbols is a valid encoding from SR. Regardless of the chosen complexity measure, computable minimization of core spanners is impossible:

Theorem 4.9

Let c be a complexity measure for\(\textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}\) . There is no algorithm that, given a\(\rho \in \textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}\), computes an equivalent\(\hat {\rho }\in \textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}\)that is c-minimal.

Proof

Define Umin to be the set of c-minimal core spanner representations of Υ. By the definition of a complexity measure, Umin is finite. Hence, given a core spanner representation ρ, we can decide whether ρUmin.

Now assume there is an algorithm MINc that minimizes core spanner representations with respect to c. Given a core spanner representation ρ, we can decide whether [[ρ]] = [[Υ]], by checking whether MINc(ρ) ∈ Umin. But as shown in Theorem 4.6, this problem is undecidable. Hence, MINc cannot exist. □

In addition to regex formulas, Fagin et al. [7] also define spanner representations that are based on so-called vset- and vstk-automata (denoted by VAset and VAstk). They show [[VAstk]] = [[RGX]] and [[VAset]] = [[RGX{π, ∪, ⋈}]], and conclude that \([{\kern -2.3pt}[ \textsf{VA}_{\textsf{set}}^{{\{\pi ,\zeta ^=,\cup ,\bowtie \}}} ]{\kern -2.3pt}]=[{\kern -2.3pt}[ \textsf{VA}_{\textsf{stk}}^{{\{\pi ,\zeta ^=,\cup ,\bowtie \}}} ]{\kern -2.3pt}]=[{\kern -2.3pt}[ \textsf{RGX}^{{\{\pi ,\zeta ^=,\cup ,\bowtie \}}} ]{\kern -2.3pt}]\). Without going futher into details, we note that their equivalence proofs use computable conversions between the models. Hence, Theorem 4.9 also applies to those spanner representations from [7] that can express core spanners, like \(\textsf{VA}_{\textsf{stk}}^{{\{\pi ,\zeta ^=,\cup ,\bowtie \}}}\) and \(\textsf{VA}_{\textsf{set}}^{{\{\pi ,\zeta ^=,\cup ,\bowtie \}}}\), and it implies that an algorithm that converts from one of these classes of representations to another cannot guarantee that its result is minimal.

Using a technique by Hartmanis [21], we can use the fact that CSp−Regularity is not co-semi-decidable to compare the relative succinctness of regular and core spanner representations:

Theorem 4.10

Letc1andc2be complexity measures for the classesRGX{π, ∪, ⋈}and\(\textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}\), respectively. For every recursive function\(f\colon \mathbb {N}\to \mathbb {N}\), there exists a\(\rho \in \textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}\)such that [[ρ]] ∈ [[RGX{π, ∪, ⋈}]], but\(c_{1}(\hat {\rho })>f(c_{2}(\rho ))\)holds for every\(\hat {\rho }\in \textsf{RGX}^{\{\pi ,\cup ,\bowtie \}}\)with\([{\kern -2.3pt}[{\hat {\rho }}]{\kern -2.3pt}]=[{\kern -2.3pt}[\rho ]{\kern -2.3pt}]\).

Proof

For the sake of a contradiction, assume that there exist complexity measures c1 for RGX{π, ∪, ⋈} and c2 for \(\textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}\), as well as a recursive function f such that, for every core spanner representation \(\rho \in \textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}\) with [[ρ]] ∈ [[RGX{π, ∪, ⋈}]], there exists a regular spanner representation \(\hat {\rho }\in \textsf{RGX}^{\{\pi ,\cup ,\bowtie \}}\) with \([{\kern -2.3pt}[ \hat {\rho } ]{\kern -2.3pt}]=[{\kern -2.3pt}[ \rho ]{\kern -2.3pt}]\) and \(c_{1}(\hat {\rho })\leq f(c_{2}(\rho ))\). Our goal is to show that this implies that the set
$$\textsf{NR} := \{\rho\in\textsf{RGX}^{\{\pi,\zeta^=,\cup,\bowtie\}} \mid \text{there is no \(\rho_{R}\in \textsf{RGX}^{\{\pi,\cup,\bowtie\}}\) with \([{\kern-2.3pt}[ \rho ]{\kern-2.3pt}]=[{\kern-2.3pt}[ \rho_{R} ]{\kern-2.3pt}]\)}\} $$
is semi-decidable. As CSp−Regularity is not co-semi-decidable (Theorem 4.6), this will yield the desired contradiction.
We define a semi-decision procedure for NR as follows: Given a core spanner \(\rho \in \textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}\), compute a complexity bound n := f(c2(ρ)). We define
$$F_{n}:=\{\rho_{R}\in \textsf{RGX}^{\{\pi,\cup,\bowtie\}}\mid c_{1}(\rho_{R})\leq n\}. $$
By Definition 4.8, the set Fn is finite, and we can effectively enumerate its elements ρ1, …, ρk for k := |Fn|.

Also by definition, we know that if there exists a ρRRGX{π, ∪, ⋈} with [[ρR]] = [[ρ]], there exists a \(\hat {\rho }_{R}\in \textsf{RGX}^{\{\pi ,\cup ,\bowtie \}}\) with \([{\kern -2.3pt}[ \hat {\rho }_{R} ]{\kern -2.3pt}]=[{\kern -2.3pt}[ \rho ]{\kern -2.3pt}]\) and \(\hat {\rho }_{R}\in F_{n}\). In other words: If [[ρ]] is expressible with regular spanners, it is expressible with a regular spanner representation \(\hat {\rho }\) that satisfies the complexity bound n.

For all ρiFn, we now semi-decide [[ρ]] ≠ [[ρi]]. In order to do this, we enumerate all wΣ. In each step, if [[ρ]](w) ≠ [[ρi]](w) holds, we mark ρi as not equivalent to ρ.

If all spanners in Fn are marked, we know that no regular spanner [[ρR]] with [[ρR]] = [[ρ]] exists, and put out True. As Fn is finite, this point is reached in a finite number of steps if there is no such spanner. On the other hand, if such a spanner exists, the procedure will never terminate. Hence, we have defined a semi-decision procedure for NR, which implies that CSp−Regularity is co-semi-decidable, a contradiction to Theorem 4.6. □

Hence, the blowup from \(\textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}\) to RGX{π, ∪, ⋈} is not bounded by any recursive function. As above, we can replace each of this classes with a class with the same expressive power; for example, we can replace RGX{π, ∪, ⋈} with VAstk{π, ∪, ⋈}, VAset, or VAset{π, ∪, ⋈} (or, as the proof uses Boolean spanners, RGX or VAstk, or any class between those).

We also consider the relative succinctness of representations of core spanners and representations of their complements. For every spanner P, we define its complementcompl(P) := ΥVars(P)P, and its hierarchical complement\(complH(P):= {\Upsilon }^H_{\textsf{Vars}(P)}\setminus P\).

Theorem 4.11

Let c be a complexity measure for\(\textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}\). For every recursive function\(f\colon \mathbb {N}\to \mathbb {N}\), there exists a\(\rho \in \textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}\)such that
  1. 1.

    \(\textsf{C}([{\kern -2.3pt}[ \rho ]{\kern -2.3pt}])\in [{\kern -2.3pt}[ \textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}} ]{\kern -2.3pt}]\), but

     
  2. 2.

    \(c(\rho )>f(c(\hat {\rho }))\)holds for every\(\hat {\rho }\in \textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}\)with\([{\kern -2.3pt}[\hat {\rho }]{\kern -2.3pt}]=compl([{\kern -2.3pt}[\rho ]{\kern -2.3pt}])\).

     
This also holds if we considerCHinstead ofC.

Proof

It suffices to prove the claim for Boolean core spanner representations (hence, we can focus on the case of C, and do not need to consider CH separately). For convenience, we define the set of all Boolean core spanner representations
$$\textsf{BCSR} :=\{\rho\in\textsf{RGX}^{\{\pi,\zeta^=,\cup,\bowtie\}}\mid {\textsf{SVars}\left( \rho\right)}=\emptyset\}. $$
As preparation for the actual proof, we consider the following sets of Boolean core spanner representations:
$$\begin{array}{@{}rcl@{}} \textsf{FIN}&:=&\{\rho\in\textsf{BCSR}\mid \mathcal{L}(\rho) \text{ is finite}\},\\ \textsf{COF}&:=& \{\rho\in\textsf{BCSR}\mid \mathcal{L}(\rho) \text{ is co-finite}\}. \end{array} $$
This proof heavily relies on various sets from the first two levels of the arithmetic hierarchy (cf. Kozen [28]). Without going into further details, note that \({\Sigma ^{0}_{1}}\) is the family of all sets that are semi-decidable (recursively enumerable), \({\Pi ^{0}_{1}}\) is the family of all thats that are co-semi-decidable (co-recursively enumerable), and \({\Delta ^{0}_{1}}={\Sigma ^{0}_{1}}\cap {\Pi ^{0}_{1}}\) is the family of all sets that are decidable.

Regarding the next level, \({\Sigma ^{0}_{2}}\) is the family of all sets that are semi-decidable when using oracles for sets in \({\Sigma ^{0}_{1}}\) (or in \({\Pi ^{0}_{1}}\)), \({\Pi ^{0}_{2}}\) is the family of all sets that are co-semi-decidable when using such oracles. Finally, \({\Delta ^{0}_{2}}={\Sigma ^{0}_{2}}\cap {\Pi ^{0}_{2}}\) is the family of all sets that are decidable when using oracles for sets in \({\Sigma ^{0}_{1}}\) or in \({\Pi ^{0}_{1}}\). □

A central part of our reasoning in this proof is the following observation:

Claim 1

\({\textsf{COF}\not \in \Delta ^{0}_{2}}\).

Proof

As shown in Freydenberger [12], the xregexes that we used in the proof of Theorem 4.6 also prove that co-finiteness for vstar-free xregexes is \({\Sigma ^{0}_{2}}\)-complete.

Hence, the proof of Theorem 4.6 also implies that COF is \({\Sigma ^{0}_{2}}\)-hard. This immediately implies \({\textsf{COF}\notin \Delta ^{0}_{2}}\); as otherwise, \({\Sigma ^{0}_{2}}={\Delta ^{0}_{2}}\) would hold, which contradicts the fact that the arithmetical hierarchy is a proper hierarchy. \(\square \) (Claim 1)

Our goal is to use Claim 1 to obtain the contradiction on which this proof rests. More precisely, we shall prove that any recursive bound on the size of the core spanner for a complement can be used to prove \({\textsf{COF}\in \Delta ^{0}_{2}}\). One of the central parts of our reasoning shall be the following result.

Claim 2

\({\textsf{FIN}\in \Sigma ^{0}_{1}}\).

Proof

We give the following semi-decision procedure for FIN. Let ρBCSR. Enumerate all finite sets SΣ. For each set, we check the following two conditions:
  1. 1.

    \(S\subseteq \mathcal {L}(\rho )\)

     
  2. 2.

    \(\mathcal {L}(\rho )\cap (\Sigma ^{*}\setminus S)=\emptyset \)

     
Note that both conditions are decidable: As S is finite, the first condition can be checked by deciding if \(w\in \mathcal {L}(\rho )\) for each wS.

For the second condition, we first construct a regular expression α with \(\mathcal {L}(\alpha )= (\Sigma ^{*}\setminus S)\). Then, we define the Boolean core spanner representation ρS := αρ. As \(\mathcal {L}(\rho _{S})=\mathcal {L}(\alpha )\cap \mathcal {L}(\rho )=(\Sigma ^{*}\setminus S)\cap \mathcal {L}(\rho )\), we can decide the second condition by checking if \(\mathcal {L}(\rho _{S})=\emptyset \) (which is decidable, according to Theorem 4.4).

If S satisfies both conditions, \(S=\mathcal {L}(\rho )\) holds. Hence, \(\mathcal {L}(\rho )\) is finite, and the semi-decision procedure returns True. Furthermore, for every ρFIN, the procedure will (after a finite number of enumerated finite sets) check the set \(S=\mathcal {L}(\rho )\), and then return True. Thus, FIN is semi-decidable, which is equivalent to \({\textsf{FIN}\in \Sigma ^{0}_{1}}\). \(\square \) (Claim 2)

The next observation is not very deep; but in order to streamline the flow of our later reasoning, we state it as a separate claim.

Claim 3

For every ρBCSR, we have that ρCOF holds if and only if there is a \(\hat {\rho }\in \textsf{FIN}\) with \([{\kern -2.3pt}[ \hat {\rho } ]{\kern -2.3pt}]=\textsf{C}([{\kern -2.3pt}[ \rho ]{\kern -2.3pt}])\).

Proof

Let ρBCSR. We begin with the if-direction. Assume there exists a \(\hat {\rho }\in \textsf{FIN}\) with \([{\kern -2.3pt}[ \hat {\rho } ]{\kern -2.3pt}]=\textsf{C}([{\kern -2.3pt}[ \rho ]{\kern -2.3pt}])\). As \(\hat {\rho }\in \textsf{FIN}\), the language \(\mathcal {L}(\hat {\rho })\) is finite, which implies that \(\mathcal {L}(\rho )=\Sigma ^{*}\setminus \mathcal {L}(\hat {\rho })\) is co-finite. Hence, ρCOF.

For the only-if direction, let ρCOF; i. e., \(\mathcal {L}(\rho )\) is co-finite. Hence, \(\Sigma ^{*}\setminus \mathcal {L}(\rho )\) is finite, and regular. Thus, there exists a proper regular expression \(\hat {\rho }\) with \(\mathcal {L}(\hat {\rho })=\Sigma ^{*}\setminus \mathcal {L}(\rho )\). As every proper regular expression is also a functional regex formula with no variables (and, hence, Boolean), \(\hat {\rho }\in \textsf{BCSR}\) follows. This gives \(\hat {\rho }\in \textsf{FIN}\), while \([{\kern -2.3pt}[ \hat {\rho } ]{\kern -2.3pt}]=\textsf{C}([{\kern -2.3pt}[ \rho ]{\kern -2.3pt}])\) holds by our choice of \(\hat {\rho }\). \(\square \) (Claim 3)

We now proceed to the main part of the proof, which uses these claims. Let c be a complexity measure for the class \(\textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}\). Assume that there exists a recursive function \(f\colon \mathbb {N}\to \mathbb {N}\) such that for all \(\rho \in \textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}\) for which C([[ρ]]) is a core spanner, there exists a \(\hat {\rho }\in \textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}\) with \([{\kern -2.3pt}[ \hat {\rho } ]{\kern -2.3pt}]=\textsf{C}([{\kern -2.3pt}[ \rho ]{\kern -2.3pt}])\) and \(c(\rho )\leq f(c(\hat {\rho }))\).

Our goal is to show that this assumption implies that COF is in \({\Delta ^{0}_{2}}\). We prove this by defining a decision procedure with oracles for \({\Sigma ^{0}_{1}}\) and \({\Pi ^{0}_{2}}\) on the input ρBCSR as follows. First, compute n := f(c(ρ)), and let
$$R_{n} := \{\hat{\rho}\in\textsf{BCSR} \mid c(\hat{\rho})\leq n\}. $$
From Claim 3, we know that ρCOF if and only if there is a \(\hat {\rho }\in \textsf{FIN}\) with \([{\kern -2.3pt}[ \hat {\rho } ]{\kern -2.3pt}]=\textsf{C}([{\kern -2.3pt}[ \rho ]{\kern -2.3pt}])\). Due to our assumption on f, this holds if and only if such a \(\hat {\rho }\) exists in Rn.
We now check for each \(\hat {\rho }\in R_{n}\) whether it satisfies these two criteria:
  1. 1.

    \(\hat {\rho }\in \textsf{FIN}\)

     
  2. 2.

    \([{\kern -2.3pt}[ \hat {\rho } ]{\kern -2.3pt}]=\textsf{C}([{\kern -2.3pt}[ \rho ]{\kern -2.3pt}])\)

     
Due to Claim 2, we know that FIN is in \({\Sigma ^{0}_{1}}\). Hence, the first criterion can be decided with a \({\Sigma ^{0}_{1}}\)-oracle.

Regarding the second criterion, note that \([{\kern -2.3pt}[ \hat {\rho } ]{\kern -2.3pt}]\neq \textsf{C}([{\kern -2.3pt}[ \rho ]{\kern -2.3pt}])\) is semi-decidable (as it suffices to find a wΣ that disproves the equality). Hence, this criterion is co-semi-decidable, which means that it can be decided with a \({\Pi ^{0}_{1}}\)-oracle.

If there exists a \(\hat {\rho }\in R_{n}\) that satisfies both criteria, the procedure returns True. In this case, ρCOF holds by Claim 3; hence, this is correct.

If no such \(\hat {\rho }\) can be found among the (finitely many) elements of Rn, the procedure returns False. As mentioned above, this is correct due to our assumptions on f.

As COF can be decided by using oracles for \({\Sigma ^{0}_{1}}\) and \({\Pi ^{0}_{1}}\), we know that \(\textsf{COF}\in {\Delta ^{0}_{2}}\) must hold. This contradicts Claim 1. As our only assumption was the existence of the recursive bound f, no such bound can exist.

In other words, there are core spanners where the (hierarchical) complement is also a core spanner, but the blowup between their representations is not bounded by any recursive function. Again, this holds for the other classes of representations as well.

This result has consequences to an open question of Fagin et al. One of the central tools in [7] is the core-simplification-lemma, which states that every core spanner is definable by an expression of the form πVSA, where A is a vset-automaton, VSVars(A), and S is a sequence of selections \(\zeta ^=_{x,y}\) for x, ySVars(A).

In addition to core spanners, Fagin et al. also discuss adding a set difference operator ∖, and ask “whether we can find a simple form, in the spirit of the core-simplification lemma, when adding difference to the representation of core spanners”. It is a direct consequence of Theorem 4.11 that such a simple representation, if it exists, cannot be obtained effectively, as reducing the number of difference operators can lead to a non-recursive blowup. While this observation does not prove that such a simple form does not exist, it suggests that any proof of its existence should be expected to be non-constructive.

5 Conclusions and Further Work

In Section 3, we have seen that core spanners can express languages that are defined by patterns or by vstar-free xregexes. We used this in Section 4 to derive various lower bounds on decision problems, even for subclasses of core spanner representations. Note that in most of the cases, these lower bounds do not require the join operator, and mostly rely on the string equality selection. This can be interpreted as a sign that string equality (or repetition) is an expensive operator, in particular as similar results have been observed for related models (e. g., [2, 12, 16]). On the other hand, Proposition 4.2 demonstrates that even without string equality, join is also an expensive operator. The authors take this as a sign that the search for good restrictions on core spanners will probably have to combine restrictions on string equality and on join.

There is also reason to hope that the connections to patterns and word equations can be beneficial for spanners: There is recent work on restricted classes of pattern languages with an efficient membership problem (e. g., [10, 33]), which could lead to subclasses of spanners that can be evaluated more efficiently. Furthermore, as Theorems 3.12 and 3.13 show, core spanners and word equations with regular constraints are closely related. Recent work on word equations has also considered tasks like enumerating all solutions of an equation. The employed compression techniques (cf. [6]) might also be used to improve the evaluation of core spanners. In particular, the ECreg-formulas that are constructed in the proof of Theorem 3.12 have the special property that there is a variable xw (for w), and for every solution σ and every variable x, we have that σ (x) is a subword of σ (xw).

Freydenberger [13] builds on this observation and introduces a fragment of ECreg that has exactly the same expressive power as core spanners. The connection is even stronger: As shown in [13], there exist polynomial time conversions between this fragment and core spanner representations. It remains to be seen whether the connection between spanners and word equations can also be used to find interesting subclasses of core spanners that have friendlier upper bounds (in particular regarding evaluation).

Also note that conversion from vstar-free regular expressions to core spanner representations that is used for Theorem 3.21 can lead to an exponential increase in size. As shown in [13], this blowup can be avoided by using a more involved construction.

Finally, while we only mentioned this explicitly in Section 4.2.1, note that most of the other results in this paper can also be directly converted to the appropriate spanner representations that use vset- and vstk-automata from [7].

Footnotes

  1. 1.

    Following the terminology of [3]; literature also uses the term rational constraints.

Notes

Acknowledgements

We thank Florin Manea for his suggestion to use word equations with regular constraints, and Thomas Zeume for reporting a list of typos. We also thank the anonymous reviewers of both this paper and the conference version for their feedback.

References

  1. 1.
    Angluin, D.: Finding patterns common to a set of strings. J. Comput. Syst. Sci. 21, 46–62 (1980)MathSciNetCrossRefMATHGoogle Scholar
  2. 2.
    Barceló, P., Libkin, L., Lin, A.W., Wood, P.T.: Expressive languages for path queries over graph-structured data. ACM Trans. Database Syst. 37(4), 31 (2012)CrossRefGoogle Scholar
  3. 3.
    Barceló, P., Muñoz, P.: Graph Logics with Rational relations: The Role of Word Combinatorics. In: Proc. CSL-LICS 2014 (2014)Google Scholar
  4. 4.
    Bremer, J., Freydenberger, D.D.: Inclusion problems for patterns with a bounded number of variables. Inform. Comput. 220–221, 15–43 (2012)MathSciNetCrossRefMATHGoogle Scholar
  5. 5.
    Câmpeanu, C., Salomaa, K., Yu, S.: A formal study of practical regular expressions. Int. J. Found Comput. Sci. 14, 1007–1018 (2003)MathSciNetCrossRefMATHGoogle Scholar
  6. 6.
    Diekert, V.: Makanin’s Algorithm. In: Lothaire, M. (ed.) Algebraic Combinatorics on Words, chapter 12, pages 387–442. Cambridge University Press (2002)Google Scholar
  7. 7.
    Fagin, R., Kimelfeld, B., Reiss, F., Vansummeren, S.: Document spanners: A formal approach to information extraction. J. ACM 62(2), 12 (2015)MathSciNetCrossRefMATHGoogle Scholar
  8. 8.
    Fagin, R., Kimelfeld, B., Reiss, F., Vansummeren, S.: Declarative cleaning of inconsistencies in information extraction. ACM Trans. Database Syst. 41(1), 6 (2016)MathSciNetCrossRefGoogle Scholar
  9. 9.
    Fernau, H., Manea, F., Mercas, R., Schmid, M.L.: Pattern Matching with variables: Fast Algorithms and New Hardness Results. In: Proc. STACS 2015 (2015)Google Scholar
  10. 10.
    Fernau, H., Schmid, M.L.: Pattern matching with variables: A multivariate complexity analysis. Inf. Comput. 242, 287–305 (2015)MathSciNetCrossRefMATHGoogle Scholar
  11. 11.
    Fernau, H., Schmid, M.L., Villanger, Y.: On the parameterised complexity of string morphism problems. Theory Comput. Sys. (2015)Google Scholar
  12. 12.
    Freydenberger, D.D.: Extended regular expressions: Succinctness and decidability. Theory Comput. Sys. 53(2), 159–193 (2013)MathSciNetCrossRefMATHGoogle Scholar
  13. 13.
    Freydenberger, D.D.: A Logic for Document Spanners. In: Proc ICDT. Accepted (2017)Google Scholar
  14. 14.
    Freydenberger, D.D., Holldack, M.: Document spanners: From Expressive Power to Decision Problems. In: Proc. ICDT 2016, p 2016Google Scholar
  15. 15.
    Freydenberger, D.D., Reidenbach, D.: Bad news on decision problems for patterns. Inform. Comput. 208(1), 83–96 (2010)MathSciNetCrossRefMATHGoogle Scholar
  16. 16.
    Freydenberger, D.D., Schweikardt, N.: Expressiveness and static analysis of extended conjunctive regular path queries. J. Comput. Syst. Sci. 79(6), 892–909 (2013)MathSciNetCrossRefMATHGoogle Scholar
  17. 17.
    Friedl, J.E.F.: Mastering Regular Expressions. O’Reilly Media. 3rd edition (2006)Google Scholar
  18. 18.
    Garey, M.R., Johnson, D.S.: Computers and intractability. W. H. Freeman and Company (1979)Google Scholar
  19. 19.
    Ginsburg, S., Spanier, E.: Semigroups, presburger formulas, and languages. Pac. J. Math. 16(2), 285–296 (1966)MathSciNetCrossRefMATHGoogle Scholar
  20. 20.
    Grohe, M., Flum, J.: Parameterized complexity theory. Texts in Theoretical Computer Science. Springer (2006)Google Scholar
  21. 21.
    Hartmanis, J.: On gödel speed-up and succinctness of language representations. Theor. Comput. Sci. 26(3), 335–342 (1983)CrossRefMATHGoogle Scholar
  22. 22.
    Holzer, M., Kutrib, M.: Descriptional complexity–an introductory survey. Sci. Appl. Language Methods 2, 1–58 (2010)MathSciNetMATHGoogle Scholar
  23. 23.
    Ibarra, O.H., Pong, T.-C., Sohn, S.M.: A note on parsing pattern languages. Pattern Recogn. Lett. 16(2), 179–182 (1995)CrossRefGoogle Scholar
  24. 24.
    Jiang, T., Kinber, E., Salomaa, A., Salomaa, K., Yu, S.: Pattern languages with and without erasing. Int. J Comput. Math. 50, 147–163 (1994)CrossRefMATHGoogle Scholar
  25. 25.
    Jiang, T., Salomaa, A., Salomaa, K., Yu, S.: Decision problems for patterns. J. Comput. Syst Sci. 50, 53–63 (1995)MathSciNetCrossRefMATHGoogle Scholar
  26. 26.
    Karhumȧki, J., Mignosi, F., Plandowski, W.: The expressibility of languages and relations by word equations. J. ACM 47(3), 483–505 (2000)MathSciNetCrossRefMATHGoogle Scholar
  27. 27.
    Kozen, D.: Lower Bounds for Natural Proof Systems. In: Proc. FOCS 1977, p 1977Google Scholar
  28. 28.
    Kozen, D.: Theory of computation. Springer-Verlag (2006)Google Scholar
  29. 29.
    Kutrib, M.: The phenomenon of non-recursive trade-offs. Int. J. Found. Comput. Sci. 16(5), 957–973 (2005)MathSciNetCrossRefMATHGoogle Scholar
  30. 30.
    Lothaire, M.: Combinatorics on Words. Cambridge University Press (1997)Google Scholar
  31. 31.
    Ohlebusch, E., Ukkonen, E.: On the equivalence problem for E-pattern languages. Theor. Comput. Sci. 186, 231–248 (1997)MathSciNetCrossRefMATHGoogle Scholar
  32. 32.
    Parikh, R.J.: On context-free languages. J. ACM 13(4), 570–581 (1966)MathSciNetCrossRefMATHGoogle Scholar
  33. 33.
    Reidenbach, D., Schmid, M.L.: Patterns with bounded treewidth. Inform. Comput. 239, 87–99 (2014)MathSciNetCrossRefMATHGoogle Scholar
  34. 34.
    Stephan, F., Yoshinaka, R., Zeugmann, T.: On the Parameterised Complexity of Learning Patterns. In: Proc. ISCIS 2011, p 2011Google Scholar

Copyright information

© The Author(s) 2017

Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Authors and Affiliations

  1. 1.Loughborough UniversityLoughboroughUK
  2. 2.Goethe UniversityFrankfurt am MainGermany

Personalised recommendations