Document Spanners: From Expressive Power to Decision Problems

Freydenberger, Dominik D.; Holldack, Mario

doi:10.1007/s00224-017-9770-0

Document Spanners: From Expressive Power to Decision Problems

Open access
Published: 22 May 2017

Volume 62, pages 854–898, (2018)
Cite this article

Download PDF

You have full access to this open access article

Theory of Computing Systems Aims and scope Submit manuscript

Document Spanners: From Expressive Power to Decision Problems

Download PDF

1587 Accesses
20 Citations
Explore all metrics

Abstract

We examine document spanners, a formal framework for information extraction that was introduced by Fagin, Kimelfeld, Reiss, and Vansummeren (PODS 2013, JACM 2015). A document spanner is a function that maps an input string to a relation over spans (intervals of positions of the string). We focus on document spanners that are defined by regex formulas, which are basically regular expressions that map matched subexpressions to corresponding spans, and on core spanners, which extend the former by standard algebraic operators and string equality selection. First, we compare the expressive power of core spanners to three models – namely, patterns, word equations, and a rich and natural subclass of extended regular expressions (regular expressions with a repetition operator). These results are then used to analyze the complexity of query evaluation and various aspects of static analysis of core spanners. Finally, we examine the relative succinctness of different kinds of representations of core spanners and relate this to the simplification of core spanners that are extended with difference operators.

The Information Extraction Framework of Document Spanners - A Very Informal Survey

A Logic for Document Spanners

Article Open access 11 September 2018

Word Equations in Synergy with Regular Constraints

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Information Extraction (IE) is the task of automatically extracting structured information from texts. This paper examines document spanners (also called spanners), a formalization of the IE query language AQL, which is used in IBM’s SystemT. Document spanners were introduced by Fagin et al. [7] in order to allow the theoretical examination of AQL, and were also used in [8].

A span is an interval on positions of a string w, and a spanner is a function that maps w to a relation over spans of w. A central topic of [7] and of the present paper are core spanners (according to Fagin et al., this name was chosen because core spanners capture the core of AQL).

The primitive building blocks of core spanners are regex formulas, which are regular expressions with variables. Each of these variables corresponds to a subexpression, and whenever a regex formula α matches a string w, each variable is mapped to the span in w that matches that subexpression. For example, consider the regex formula α := x{aaa} ⋅ a ⁺ ⋅ y{a ⁺}, with terminal a, and variables x and y. When α matches a string w, it maps x to the span that contains the first three positions of w, and y to a span from some position after the third to the last position of w. Hence, each match of α on w determines a tuple of spans; and as there can be multiple matches of a regex formula to a string, this process creates a relation over spans of w. Core spanners are then defined by extending regex formulas with the relational operations projection, union, natural join, and string equality selection.

One of the two main topics of the present paper is the examination of decision problems for core spanners, in particular evaluation and static analysis. These results are mostly derived from the other main topic, the examination of the expressive power of core spanners in relation to three other models that use repetition operators, which act similar to the spanners’ string equality selection.

We begin with comparing core spanners to patterns. A pattern is word that consists of variables and terminals, and generates the language of all words that can be obtained by substitution of the variables with arbitrary terminal words. For example, the pattern α = x x ab y (where x and y are variables, and a and b are terminals) generates the language of all words that have a prefix that consists of a square, followed by the word ab. Although pattern languages have a simple definition, various decision problems for them are surprisingly hard. For example, their membership problem is NP-complete (cf. Angluin [1], Jiang et al. [24]), and their inclusion problem is undecidable (cf. Bremer and Freydenberger [4]). As we show that core spanners can recognize pattern languages, this allows us to conclude that evaluation of Boolean core spanners is NP-hard, and that spanner containment is undecidable.

Next, we consider word equations, which are equations of the form α = β, where α and β are patterns. Word equations can be used to define languages and word relations. We show that word equations with regular constraints can express all relations that are expressible with core spanners. By using an improved version of Makanin’s algorithm (cf. Diekert [6]), this allows us to show that satisfiability and hierarchicality for core spanners can be decided in PSPACE. Moreover, using coding techniques from word equations, we show that two common relations from combinatorics on words can be selected with core spanners.

Finally, we examine the relation of core spanners to xregexes (also called extended regular expressions, regexes, or regular expressions with backreferences in literature). These are regular expressions that can use a repetition operator, that is available in most modern implementations for regular expressions (see, e. g., Friedl [17]) and that allows the definition of non-regular languages. For example, the xregex x{ Σ ^∗} ⋅ &x ⋅ &x generates all cubic words over Σ , as x{ Σ ^∗} generates some word w which is stored in the variable x, and each occurrence of &x repeats that w. As a consequence of this increase in expressive power, many decision problems are harder for xregexes than for their “classical” counterparts. In particular, various problems of static analysis are undecidable (Freydenberger [12]).

But as shown by Fagin et al. [7], core spanners cannot define all languages that are definable by xregexes. Intuitively, the reason for this is that xregexes can use their repetition operators inside a Kleene star, which allows them to repeat an arbitrary word an unbounded number of times – for example, the xregex x{ Σ ^∗}⋅&x ⁺ generates the language of all w ⁿ, n ≥ 2. In contrast to this, core spanners have to express repetitions with variables and string equality selections. Inspired by this observation, we introduce variable-star free (or vstar-free) xregexes as those xregexes that neither define nor use variables inside a Kleene star. We show that every vstar-free xregex can be converted into an equivalent core spanner. Since all undecidability results by Freydenberger [12] also apply to vstar-free xregexes, these undecidability results carry over to core spanners. This also has various consequences for the minimization and the relative succinctness of classes of spanner representations. We also show that complementing a core spanner can lead to a size increase that is not bounded by any recursive function (for basically all natural notions of size). Although this does not solve an open problem by Fagin et al. [7] on the simplification of core spanners with difference operators, it shows that if simplification is possible, it has to be non-computable. As a further contribution, we also develop tools to prove inexpressibility for vstar-free regular expressions and for core spanners.

As we shall see, many of the observed lower bounds hold even for comparatively restricted classes of core spanners (in particular, most of the results hold for spanners that do not use join). Hence, the authors consider it reasonable to expect that these results can be easily adapted to other information extraction languages that combine regular expressions with capture variables and a string equality operator.

In addition to regex formulas, Fagin et al. [7] also consider two types of automata as basic building blocks of spanner representations. While the present paper does not discuss these in detail, most of the results on spanner representations that are based on regex formulas can be directly converted to the respective class of spanner representations that are based on automata.

Related Work

For an overview of related models, we refer to Fagin et al. [7]. In addition to this, we highlight connections to models with similar properties. In [7], Fagin et al. showed that there is a language that can be defined by xregexes, but not by core spanners. Furthermore, they compared the expressive power of core spanners and a variant of conjunctive regular path queries (CRPQs), a graph querying language. Barceló et al. [2] introduced extended CRPQs (ECRPQs), which can compare paths in the graph with regular relations. While there is no direct connection between ECRPQs and core spanners, both models share the basic idea of combining regular languages with a comparison operator that can express string equality. As shown by Freydenberger and Schweikardt [16], ECRPQs have undecidability results that are comparable to those in the present paper, and to those for xregexes (cf. Freydenberger [12]). Furthermore, Barceló and Muñoz [3] have used word equations with regular constraints for variants of CRPQs.

Also note that Freydenberger [13] extends the results on the connection between word equations and core spanners from the present paper into a logic on words that has the same expressive power as core spanners.

Structure of the Paper

In Section 2, we give definitions of xregexes and of core spanners. Section 3 compares the expressive power of core spanners to patterns, word equations, and vstar-free regular expressions. The results from this section are then used in Section 4 to examine the complexity of evaluation and static analysis of spanners. We also examine the consequences of these results to the relative succinctness of different spanner representations. Section 5 concludes the paper.

2 Preliminaries

Let $\mathbb {N}$ and $\mathbb {N}_{>0}$ be the sets of non-negative and positive integers, respectively. Let Σ be a fixed finite alphabet of (terminal) symbols. Except when stated otherwise, we assume | Σ | ≥ 2. We use ε to denote the empty word. For every word w ∈ Σ ^∗ and every a ∈ Σ , let |w| denote the length of w, and |w|_a the number of occurrences of a in w. A word x ∈ Σ ^∗ is a subword of a word y ∈ Σ ^∗ if there exist u, v ∈ Σ ^∗ with y = u x v . A word x ∈ Σ ^∗ is a prefix of a word y ∈ Σ ^∗ if there exists a v ∈ Σ ^∗ with y = x v , and a proper prefix if it is a prefix and x ≠ y. For every $n\in \mathbb {N}$, an n-ary word relation (over Σ) is a subset of (Σ ^∗)ⁿ.

2.1 Regexes (Extended Regular Expressions)

This section introduces the syntax and semantics of xregexes, which we shall also use for regex formulas in Section 2.2. We begin with the syntax, which follows the definition from [7].

Definition 2.1

We fix an infinite set X of variables and define the set M of meta symbols as M := {ε, ∅, (,), {,}, ⋅, ∨, ^∗, &}. Let Σ , X, and M be pairwise disjoint. The set of x regexes (extended regular expressions) is defined as follows:

1.
The symbols ∅ and ε, and every a ∈ Σ are xregexes.
2.
If α ₁ and α ₂ are xregex, then (α ₁ ⋅ α ₂) (concatenation), (α ₁ ∨ α ₂) (disjunction), and $(\alpha _{1}^{*})$ (Kleene star) are xregexes.
3.
For every x ∈ X and every xregex α that contains neither x{⋯} nor &x as a subword, x{α} is an xregex (variable binding).
4.
For every x ∈ X, we have that &x is an xregex (variable reference).

If a subword β of an xregex α is an xregex itself, we call β a subexpression (of α). The set of all subexpressions of α is denoted by S u b(α), and the set of variables occurring in variable bindings in an xregex α is denoted by V a r s(α). If an xregex α contains neither variable references, nor variable bindings, we call α a proper regular expression.

In other words, we use the term “proper” to distinguish those expressions that are usually just called “regular expressions” from the more general extended regular expressions. We use the notation α ⁺ as a shorthand for α ⋅ α ^∗. Parentheses can be added freely. We may also omit parentheses and the concatenation operator, where we assume ∗ and + are taking precedence over concatenation, and concatenation precedes disjunction. Furthermore, we use Σ as a shorthand for the regular expression $\bigvee _{a\in \Sigma } a$.

Before introducing the semantics of xregexes formally, we give an intuitive explanation. An expression of the form α = x{β} matches the same strings as β, but α additionally stores the matched string in the variable x. Using a variable reference &x, this string can then be repeated. For example, let α := (x{ Σ ^∗} ⋅ &x). The subexpression x{ Σ ^∗} matches any string w ∈ Σ ^∗ and stores this match in x. The following variable reference &x repeats the stored w. Thus, α defines the (non-regular) copy-language {w w∣w ∈ Σ ^∗}.

The following definition of the semantics of xregexes is based on the semantics by Freydenberger [12], which is an adaption of the semantics from Câmpeanu et al. [5] (the former uses variables, the latter backreferences). In comparison to [12], the case for Kleene star has been changed, in order to make the definition compatible with the parse trees for regex formulas from Fagin et al. [7].

Definition 2.2

Let γ be an xregex over Σ and X. A γ-parse tree is a finite, directed, and ordered tree T _γ. Its nodes are labeled with tuples of the form (w, γ′) ∈ ( Σ ^∗ × S u b(γ)). The root of every γ-parse tree T _γ is labeled (w, γ) with w ∈ Σ ^∗; and the following rules must hold for each node v of T _γ:

1)
If v is labeled (w, a) with a ∈ ( Σ ∪ {ε}), then v is a leaf, and w = a.
2)
If v is labeled (w, (β ₁ ⋅ β ₂)), then v has exactly one left child v ₁ and exactly one right child v ₂ with respective labels (w ₁, β ₁) and (w ₂, β ₂), and w = w ₁ w ₂.
3)
If v is labeled (w, (β ₁ ∨ β ₂)), then v has a single child, labeled (w, β ₁) or (w, β ₂).
4)
If v is labeled (w, β ^∗), then one of the following cases holds: (a) w = ε, and v is a leaf, or (b) w = w ₁ w ₂…w _k for words w ₁, …, w _k ∈ Σ ⁺ (with k ≥ 1), and v has k children v ₁, …, v _k (ordered from left to right) that are labeled (w ₁, β), …, (w _k, β).
5)
If v is labeled (w, x{β}), then v has a single child, labeled (w, β).
6)
If v is labeled (w, &x), let ≺ denote the post-order of the nodes of T _γ (that results from a left-to-right, depth-first traversal). Then one of the following cases applies: (a) If there is no node v′ with v′ ≺ v that is labeled (w′, x{β′}) ∈ Σ ^∗ × S u b(γ), then v is a leaf, and w = ε. (b) Otherwise, let v′ be the node with v′ ≺ v that is ≺-maximal among nodes labeled (w′, x{β′}). Then v is a leaf, and w = w′.

If the root of a γ-parse tree T _γ is labeled (w, γ), we call T _γ a γ -parse tree for w. If the context is clear, we omit γ and call T _γ a parse tree.

There is no parse tree for ∅, and references to unbound variables (i. e., variables that were not assigned a value with a variable binding operator) default to ε. For an example of a parse tree, see Fig. 1.

We use parse trees to define the semantics of xregexes:

Definition 2.3

An xregex γ recognizes the language $\mathcal {L}(\gamma )$ of all w ∈ Σ ^∗ for which there exists a γ-parse tree T _γ with (w, γ) as root label.

Example 2.4

Consider the xregexes α := x{Σ ⁺}⋅(&x)⁺, β := x{Σ ⁺}⋅&x ⋅ x{Σ ⁺}⋅&x, and γ := x{a a ⁺}⋅(&x)⁺ for some a ∈ Σ.

Then $\mathcal {L}(\alpha )=\{w^{n}\mid w\in \Sigma ^{+}, n\geq 2\}$, $\mathcal {L}(\beta )=\{x_{1}x_{1}x_{2}x_{2}\mid x_{1},x_{2}\in \Sigma ^{+}\}$ , and $\mathcal {L}(\gamma )=\{a^{n}\mid n\geq 2, \text {\textit {n} is not prime}\}.$

2.2 Document Spanners

Let w := a ₁ a ₂⋯a _n be a word over Σ, with $n\in \mathbb {N}$ and a ₁, …, a _n ∈ Σ. A span of w is an interval [i, j〉 with 1 ≤ i ≤ j ≤ n + 1 and $i,j \in \mathbb {N}$. For each span [i, j〉 of w, we define a subword w _[i,j〉 := a _i⋯a _j−1. In other words, each span describes a subword of w by its bounding indices. Two spans [i, j〉 and [i′, j′〉 of w are equal if and only if i = i′ and j = j′. These spans overlap if i ≤ i′ < j or i′ ≤ i < j′, and are disjoint, otherwise. The span [i, j〉 contains the span [i′, j′〉 if i ≤ i′ ≤ j′ ≤ j. The set of all spans of w is denoted by S p a n s(w).

Example 2.5

Let w := aabbcabaa. As |w| = 9, both [1, 3〉 and [8, 10〉 are spans of w, but [10, 11〉 is not. Although w _[1,3〉 = w _[8,10〉 = aa, the first two spans are not equal. Likewise, the two spans [3, 3〉 and [5, 5〉 are not equal, even though w _[3,3〉 = w _[5,5〉 = ε. The whole word w is described by the span [1, 10〉.

Definition 2.6

Let SVars be a fixed, infinite set of span variables, where Σ and SVars are disjoint. Let V ⊂ SVars be a finite subset of SVars, and let w ∈ Σ ^∗. A (V, w)-tuple is a function μ: V → S p a n s(w), that maps each variable in v to a span of w. If context allows, we write w-tuple instead of (V, w)-tuple. A set of (V, w)-tuples is called a (V, w)-relation.

As V and S p a n s(w) are finite, every (V, w)-relation is finite by definition. Our next step is the definition of document spanners, which map words w to (V, w)-relations:

Definition 2.7

Let V and Σ be alphabets of variables and symbols, respectively. A (document) spanner is a function P that maps every word w ∈ Σ ^∗ to a (V, w)-relation P(w). Let V be denoted by SVars(P). A spanner P is n-ary if |SVars(P)| = n, and Boolean if SVars(P) = ∅. For all w ∈ Σ ^∗, we say P(w) = True and P(w) = False instead of P(w) = {()} and P(w) = ∅, respectively.

A w-tuple μ ∈ P(w) is hierarchical if for all x, y ∈ SVars(P) at least one of the following holds: (1) The span μ(x) contains μ(y), (2) the span μ(y) contains μ(x), or (3) the spans μ(x) and μ(y) are disjoint. A spanner P is hierarchical if, for every w ∈ Σ ^∗, every μ ∈ P(w) is hierarchical.

A spanner P is total on w if P(w) contains all w-tuples over SVars(P). Let Y ⊂ SVars be a finite set of variables. The universal spanner over Y is denoted by Υ_Y. It is the unique spanner P′ such that SVars(P′) = Y and P′ is total on every w ∈ Σ ^∗. Furthermore, a spanner P is hierarchical total on w if P(w) is exactly the set of all hierarchical w-tuples over SVars(P); and the universal hierarchical spanner over a set Y is the unique spanner ${\Upsilon }^{\mathbf {H}}_{Y}$ that is hierarchical total on every w ∈ Σ ^∗.

For two spanners P ₁ and P ₂, we write P ₁ ⊆ P ₂ if P ₁(w) ⊆ P ₂(w) for every w ∈ Σ ^∗, and P ₁ = P ₂ if P ₁(w) = P ₂(w) for every w ∈ Σ ^∗.

Hence, a spanner can be understood as a function that maps a word w to a set of functions, each of which assigns spans of w to the variables of the spanner. As Boolean spanners are functions that map words to truth values, they can be interpreted as characteristic functions of languages. For every Boolean spanner P, we define the language recognized by P as $\mathcal {L}(P):=\{w\in \Sigma ^{*}\mid P(w)=\texttt {True}\}$. We extend this to arbitrary spanners P by $\mathcal {L}(P):=\{w\in \Sigma ^{*}\mid P(w)\neq \emptyset \}$.

Definition 2.8

A regex formula is an xregex α over Σ and X := SVars such that α does not contain any variable references, and for every β ∈ S u b(α) with β = γ ^∗, no subexpression of γ may be a variable binding.

In other words, a regex formula is a proper regular expression that is extended with variable binding operators, but these operators may not occur inside a Kleene star. We define SVars(γ) := V a r s(γ) for all regex formulas γ.

To define the semantics of regex formulas, we use the definition of parse trees for xregexes, see Definition 2.2. Intuitively, the goal of this definition is that each occurrence of a variable x in a γ-parse tree is matched to the corresponding span. Here, two problems can arise. Firstly, a variable might not occur in the parse tree; for example, when matching the regex formula (x{a} ∨ bb) to the word bb. Secondly, a variable might be defined too often, as e. g. in the regex formula x{Σ ⁺} ⋅ x{Σ ⁺}. In order to avoid such problems, we introduce the notion of a functional regex formula.

Definition 2.9

Let γ be a regex formula. We call γ functional if for every w ∈ Σ ^∗ and every γ-parse tree T _γ for w, for each variable x ∈ SVars(γ), there exactly one node of T _γ has a label of the form (v, x{β}), where v is a subword of w and β is a sub-regex formula of γ. The class of all functional regex formulas is denoted by RGX.

As shown in Proposition 3.5 in Fagin et al. [7], functionality has a straightforward syntactic characterization: Basically, variables may not be redeclared, variables may not be used inside of Kleene stars, and if variables are used in a disjunction, each side of a disjunction has to bind exactly the same variables. Consider the following example:

Example 2.10

The regex formula γ ₁ := (x{a} ∨ x{b}) is functional even though it contains two occurrences of variable definitions for x. There are just two γ ₁-parse trees, both of which only contain one node labeled (c, x{c}), where c ∈ {a, b}. As a trivial case, even γ ₂ := x{∅} is functional (as no γ ₂-parse tree exists). Furthermore, the regex formulas γ ₃ := x{(a ∨ b)^∗} ⋅ x{b ⁺} and γ ₄ := a ^∗ ∨ x{b} are not functional. Finally, γ ₅ := x{a}^∗ is not a regex formula at all.

For functional regex formulas, we use parse trees to define the semantics:

Definition 2.11

Let γ be a functional regex formula and let T be a γ-parse tree for a word w ∈ Σ ^∗. For every node v of T, the subtree that is rooted at v naturally maps to a span p(v) of w. As γ is functional, for every x ∈ SVars(γ), exactly one node v _x of T has a label that contains x. We define μ ^T: SVars(γ) → S p a n s(w) by μ ^T(x) := p(v _x). Each γ ∈ RGX defines a spanner [[γ]] by

$${\left[{\kern-2.3pt}[ \gamma \right]{\kern-2.3pt}]}(w):= \{\mu^{T}\mid \text{\textit{T} is a $\gamma$-parse tree for \textit{w}}\}$$

for each w ∈ Σ ^∗.

Example 2.12

Assume that a, b ∈ Σ. We define the regex formula

$$\alpha:= \Sigma^{*} \cdot x\{\mathtt{a}\cdot y\{\Sigma^{*}\} \cdot (z\{\mathtt{a}\}\mathbin{\vee} z\{\mathtt{b}\})\}\cdot\Sigma^{*}.$$

Let w := baaba. Then [[α]](w) consists of ([2, 4〉, [3, 3〉, [3, 4〉), ([2, 5〉, [3, 4〉, [4, 5〉), ([2, 6〉, [3, 5〉, [5, 6〉), ([3, 5〉, [4, 4〉, [4, 5〉), and ([3, 6〉, [4, 5〉, [5, 6〉).

For every w ∈ Σ ^∗, a spanner P defines a (V, w)-relation P(w). In order to construct more sophisticated spanners, we introduce spanner operators.

Definition 2.13

Let P, P ₁, P ₂ be spanners and let w ∈ Σ ^∗. The algebraic operators union, projection, natural join and selection are defined as follows.

Union::: Two spanners P ₁ and P ₂ are union compatible if SVars(P ₁) = SVars(P ₂), and their union (P ₁ ∪ P ₂) is defined by SVars(P ₁ ∪ P ₂) := SVars(P ₁) = SVars(P ₂) and (P ₁ ∪ P ₂)(w) := P ₁(w) ∪ P ₂(w) for every w ∈ Σ ^∗.
Projection::: Let Y ⊆ SVars(P). The projection π _Y P is defined by SVars(π _Y P) := Y and π _Y P(w) := P(w)|_Y for all w ∈ Σ ^∗, where P(w)|_Y is the restriction of all w-tuples in P(w) to Y .
Natural join::: Let V _i := SVars(P _i) for i ∈ {1, 2}. The (natural) join (P ₁ ⋈ P ₂) of P ₁ and P ₂ is defined by SVars(P ₁⋈P ₂) := SVars(P ₁) ∪ SVars(P ₂) and, for all w ∈ Σ ^∗, we define (P ₁ ⋈ P ₂)(w) as the set of all (V ₁ ∪ V ₂, w)-tuples μ for which there exist (V _i, w)-tuples μ ₁ and μ ₂ with ${\mu }(w)|_{V_{1}} = \mu _{1}(w)$ and ${\mu }(w)|_{V_{2}} = \mu _{2}(w)$.
Selection::: Let R ⊆ (Σ ^∗)^k be a k-ary relation over Σ ^∗. The selection operator ζ ^R is parameterized by k variables x ₁, …, x _k ∈ V a r s(P), written as $\zeta ^{R}_{x_{1},\dots ,x_{k}}$. The selection $\zeta ^{R}_{x_{1},\dots ,x_{k}} P$ is defined by $\textsf{SVars}(\zeta ^{R}_{x_{1},\dots ,x_{k}} P) := {\textsf{SVars}\left (P\right )}$ and, for all w ∈ Σ ^∗, we define $\zeta ^{R}_{x_{1},\dots ,x_{k}} P(w)$ as the set of all μ ∈ P(w) for which $\left (w_{\mu (x_{1})}, \dots , w_{\mu (x_{k})}\right ) \in R$.

Like [7], we mostly consider the string equality selection operator ζ ⁼. Hence, unless otherwise noted, the term “selection” refers to selection by the n-ary string equality relation. Note that unlike selection (which compares strings), join requires that the spans are identical.

The join P ₁ ⋈ P ₂ of two spanners P ₁ and P ₂ is equivalent to the intersection P ₁ ∩ P ₂ if SVars(P ₁) = SVars(P ₂), and to the Cartesian Product P ₁ × P ₂ if SVars(P ₁) and SVars(P ₂) are disjoint. Hence, if applicable, we write ∩ and × instead of ⋈.

For convenience, we may add and omit parentheses. We assume there is an order of precedence with projection and selection ranking over join ranking over union, e.g. we may write $\pi _{Y} \zeta ^=_{x,y} P_{1} \cup P_{2} \bowtie P_{3}$ instead of $(\pi _{Y} \zeta ^=_{x,y} P_{1} \cup (P_{2} \bowtie P_{3}))$, where projection and selection are applied to P ₁, and the result is united with the join of P ₂ and P ₃.

Example 2.14

Let $P_{1}:= \zeta ^=_{x,y} {\left [{\kern -2.3pt}[ x\{\Sigma ^{*}\} y\{\Sigma ^{*}\} \right ]{\kern -2.3pt}]}$ and $P_{2}:= \zeta ^=_{x,y,z}{\left [{\kern -2.3pt}[x\{\Sigma ^{*}\} y\{\Sigma ^{*}\} z\{\Sigma ^{*}\} \right ]{\kern -2.3pt}]}$ . Then $\mathcal {L}(P_{1})=\{ww\mid w\in \Sigma ^{*}\}$ , and the variables x and y refer to the span of the first and second occurrence of w, respectively. Analogously, $\mathcal {L}(P_{2})=\{w^{3}\mid \in \Sigma ^{*}\}$ (and z refers to the third occurrence of w). Assume that we want to construct a spanner for the language {w ⁿ∣w ∈ Σ ^∗, n ∈ {2, 3}}. As P ₁ and P ₂ are not union compatible, we cannot simply define P ₁ ∪ P ₂. Union compatibility can be achieved by projecting P ₂ onto the set of common variables (i. e., π _{x,y} P ₂).

Definition 2.15

A spanner algebra is a finite set of spanner operators. If O is a spanner algebra, then RGX ^O denotes the set of all spanner representations that can be constructed by (repeated) combination of the symbols for the operators from O with regex formulas from RGX. For each operator o ∈ O and each spanner representation of the form o ρ (if o is unary) or ρ ₁ o ρ ₂ (if o is binary), we define [[o ρ]] := o[[ρ]] or [[ρ ₁ o ρ ₂]] := [[ρ ₁]] o [[ρ ₂]], respectively. Furthermore, [[RGX ^O]] is the closure of [[RGX]] under the spanner operators in O.

We define $\mathcal {L}(\rho ):=\mathcal {L}([{\kern -2.3pt}[ \rho ]{\kern -2.3pt}])$ for every spanner representation ρ. Fagin et al. [7] refer to [[RGX]] as the class of hierarchical regular spanners and to [[RGX ^{{π, ∪, ⋈}}]] as the class of regular spanners. In addition to (hierarchical) regular spanners, Fagin et al. also introduced the so-called core spanners, which are obtained by combining regex formulas with the four algebraic operators projection, selection, union, and join – in other words, the class of core spanners is the class $[{\kern -2.3pt}[\textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}]{\kern -2.3pt}]$. Analogously, $\textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}$ is the class of core spanner representations.

3 Expressibility Results

3.1 Pattern Languages

We begin our examination of the expressive power of core spanners by comparing them to one of the simplest mechanisms with repetition operators:

Definition 3.1

Let X be an infinite variable alphabet that is disjoint from Σ. A pattern is a word α ∈ (Σ ∪ X)⁺ that generates the language

$$\mathcal{L}(\alpha):=\{\sigma(\alpha)\mid \text{$\sigma$ is a pattern substitution}\}, $$

where a pattern substitution is a homomorphism Σ : (Σ ∪ X)^∗ → Σ ^∗ with σ (a) = a for all a ∈ Σ. We denote the set of all variables in α by V a r s(α).

Intuitively, a pattern α generates exactly those words that can be obtained by replacing the variables in α with terminal words homomorphically (i. e., multiple occurrences of the same variable have to be replaced in the same way). This type of pattern languages is also called erasing pattern language (cf. Jiang et al. [24]).

Example 3.2

Let x, y ∈ X and a, b ∈ Σ. The patterns α := x x and β := x a y b x generate the languages $\mathcal {L}(\alpha )=\{ww\mid w\in \Sigma ^{*}\}$ and $\mathcal {L}(\beta )=\{v \mathtt {a} w \mathtt {b} v\mid v,w\in \Sigma ^{*}\}.$

From every pattern α, we can straightforwardly construct an xregex for $\mathcal {L}(\alpha )$. A similar observation holds for core spanners:

Theorem 3.3

There is an algorithm that, given a pattern α, computes in polynomial time $\rho _{\alpha }\in \textsf{RGX}^{\{\zeta ^=\}}$ such that $\mathcal {L}(\rho _{\alpha })=\mathcal {L}(\alpha )$.

Proof

Let α = α ₁⋯α _n with $n\in \mathbb {N}_{>0}$ and α ₁, …, α _n ∈ (Σ ∪ X). We rewrite α into a regex formula $\hat {\alpha }$, by replacing the i-th occurrence of a variable x with a binding x _i{Σ ^∗}. More formally, we define $\hat {\alpha }:= \hat {\alpha }_{1}{\cdots } \hat {\alpha }_{n}$, where for each i ∈ {1, …, n}, the regex formula $\hat {\alpha }_{i}$ is defined as follows:

1.
If α _i is a terminal (i. e., there is an a ∈ Σ with α _i = a), let $\hat {\alpha }_{i}:= a$.
2.
If α _i is the j-th occurrence of a variable x ∈ X in α, let $\hat {\alpha }_{i}:= x_{j}\{\Sigma ^{*}\}$.

Hence, no variable occurs twice in $\hat {\alpha }$; and as $\hat {\alpha }$ contains no disjunctions on variables, $\hat {\alpha }$ is functional.

We now define S to be a sequence of selections; where S contains exactly the selections $\zeta ^=_{x_{1},\ldots ,x_{k}}$ for each x ∈ V a r s(α) with |α|_x = k and k ≥ 2. In other words, for each x that occurs more than once in α, we include a selection of all x _i.

Finally, we define $\rho _{\alpha }:= S \hat {\alpha }$. It is easy to see that $\mathcal {L}(\rho _{\alpha })=\mathcal {L}(\alpha )$: For every $w\in \mathcal {L}(\alpha )$, we can use a pattern substitution σ with σ (α) to construct a corresponding w-tuple μ for ρ _α. Likewise, for every $w\in \mathcal {L}(\rho _{\alpha })$, there exists a corresponding w-tuple μ from which we can reconstruct a pattern substitution σ with σ (α) = w: By the construction of ρ _α, for each pair of variables x _i, x _j in $\hat {\alpha }$, the words $w_{\mu (x_{i})}$ and $w_{\mu (x_{j})}$ must be identical. This allows us to define $\sigma (x):= w_{\mu (x_{1})}$. □

Example 3.4

Let x, y, z ∈ X, a, b ∈ Σ, and define the pattern α := x a y y b x z x. The construction in the proof of Theorem 3.3 leads to the spanner representation $\zeta ^=_{x_{1},x_{2},x_{3}}\zeta ^=_{y_{1},y_{2}} \gamma $, where γ = x ₁{Σ ^∗}⋅a⋅y ₁{Σ ^∗}⋅y ₂{Σ ^∗}⋅b⋅x ₂{Σ ^∗}⋅z ₁{Σ ^∗}⋅x ₃{Σ ^∗}.

While the construction in the proof of Theorem 3.3 is so simple that it might not seem noteworthy, it will prove quite useful: In contrast to their simple definition, many canonical decision problems for them are surprisingly hard. Via Theorem 3.3, the corresponding lower bounds also apply to spanners, as we discuss in Sections 4.1 and 4.2.

3.2 Word Equations and Existential Concatenation Formulas

In this section, we introduce word equations, which are equations of patterns (cf. Definition 3.1) and can be used to define languages and relations, cf. Karhumäki et al. [26]:

Definition 3.5

A word equation is a pair η := (η _L, η _R) of patterns η _L and η _R. A pattern substitution σ is a solution of η if σ (η _L) = σ (η _R). We define V a r s(η) := V a r s(η _L) ∪ V a r s(η _R). For k ≥ 1, a relation R ⊆ (Σ ^∗)^k is defined by a word equation η := (η _L, η _R) if there exist variables x ₁, …, x _k ∈ V a r s(η) such that $R=\left \{\left (\sigma (x_{1}),\ldots ,\sigma (x_{k})\right )\mid \text {\(\sigma $ is a solution of $\eta $}\right \}.\)

We also write (η _L, η _R) as η _L = η _R. As we shall see just after the next definition both sides of the equation may have common variables. The following relations are well known examples of relations that are definable by word equations:

Definition 3.6

Over Σ ^∗, we define relations

$$\begin{array}{@{}rcl@{}} R_{\text{com}}&:=&\{(x,y)\mid \text{$x,y\in\{u\}\text{} ^{*}$ for some $u\in\Sigma^{*}$}\},\\ R_{\text{cyc}}&:=&\{(x,y)\mid\text{\textit{x} is a cyclic permutation of \textit{y}}\}. \end{array} $$

As shown in Lothaire [30], the relation R _com is defined by the equation x y = y x, and R _cyc is defined by the equation x z = z y.

Let R be a k-ary string relation, and let C be a class of spanners. We say that R is selectable by C, if for every spanner P ∈ C and every sequence of variables x = (x ₁, …, x _k) with x ₁, …, x _k ∈ SVars(P), the spanner $\zeta ^{R}_{\mathbf {x}} P$ is also in C.

Proposition 3.7

The relations R _com and R _cyc are selectable by core spanners.

Proof

Both parts of the proof use a technique from [7]. Let x = x ₁, ..., x _k be a sequence of distinct span variables (k ≥ 1), and let X := {x ₁, …, x _k}. The spanner $\zeta ^{R}_{\mathbf {x}} {\Upsilon }_{X}$ is called the R-restricted universal spanner over x, and is denoted by ${\Upsilon }^{R}_{\mathbf {x}}$. According to Proposition 4.15 in [7], in order to show that a R is selectable by core spanners, it suffices to show that ${\Upsilon }^{R}_{\mathbf {x}}$ is a core spanner for every x ∈ SVars ^k.

R _cyc: Note that for all x, y ∈ Σ ^∗, the word x is a cyclic permutation of y (and vice versa) if and only if there exist u, v ∈ Σ ^∗ with x = u v and y = v u (see e. g. Lothaire [30]). Hence we can define the core spanner $P_{\text {cyc}}:= \pi _{\{x,y\}} \hat {P}$, where

$$\hat{P}:=\zeta^=_{u_{1},u_{2}}\zeta^=_{v_{1},v_{2}} [{\kern-2.3pt}[ \alpha_{x}\times \alpha_{y} ]{\kern-2.3pt}], $$

and the regex formulas α _x and α _y are defined as

$$\begin{array}{@{}rcl@{}} \alpha_{x} &:=& \Sigma^{*} x\left\{u_{1}\{\Sigma^{*}\}\cdot v_{1}\{\Sigma^{*}\}\right\}\Sigma^{*},\\ \alpha_{y} &:=& \Sigma^{*} y\left\{v_{2}\{\Sigma^{*}\}\cdot u_{2}\{\Sigma^{*}\}\right\}\Sigma^{*}. \end{array} $$

In order to prove that $P_{\text {cyc}}={\Upsilon }^{R_{\text {cyc}}}_{x,y}$, we first observe that, for every w ∈ Σ ^∗ and every μ ∈ P _cyc(w), there exists a $\hat {\mu }\in \hat {P}(w)$ with $\mu (x)=\hat {\mu }(x)$ and $\mu (y)=\hat {\mu }(y)$. The selections enforce $u:= w_{\hat {\mu }(u_{1})}=w_{\hat {\mu }(u_{2})}$ and $v:= w_{\hat {\mu }(v_{1})}=w_{\hat {\mu }(v_{2})}$. Hence, w _μ(x) = u v and w _μ(y) = v u, which means that (w _μ(x), w _μ(y)) ∈ R _cyc, and $\mu \in {\Upsilon }^{R_{\text {cyc}}}_{x,y}(w)$. For the other direction, we can show analogously that every $\mu \in {\Upsilon }^{R_{\text {cyc}}}_{x,y}(w)$ can be extended into a $\hat {\mu }\in \hat {P}(w)$, which then proves μ ∈ P _cyc(w).

R _com: This proof relies on another fact from combinatorics on words. For all x, y ∈ Σ ^∗, the equation x y = y x holds if and only if (x, y) ∈ R _com (again, see Lothaire [30]). We define a core spanner $P_{\text {com}}:= \pi _{\{x,y\}}\hat {P}$, where

$$\hat{P}:=\zeta^=_{r_{1},r_{2},r_{3},r_{4}}\zeta^=_{x,x_{2}}\zeta^=_{y,y_{2}}\zeta^=_{\hat{x},\hat{x}_{2}}\zeta^=_{\hat{y},\hat{y}_{2}} [{\kern-2.3pt}[ \alpha_{1}\times \alpha_{2} \times \alpha_{3}\times \alpha_{4} ]{\kern-2.3pt}], $$

and the regex formulas α ₁, …, α ₄ are defined as

$$\begin{array}{@{}rcl@{}} \alpha_{1} &:=& \Sigma^{*} x\left\{\hat{x}\{\Sigma^{*}\}\cdot r_{1}\{\Sigma^{*}\}\right\}\Sigma^{*},\\ \alpha_{2} &:=& \Sigma^{*} x_{2}\left\{r_{2}\{\Sigma^{*}\}\cdot \hat{x}_{2}\{\Sigma^{*}\}\right\}\Sigma^{*},\\ \alpha_{3} &:=& \Sigma^{*} y\left\{\hat{y}\{\Sigma^{*}\}\cdot r_{3}\{\Sigma^{*}\}\right\}\Sigma^{*},\\ \alpha_{4} &:=& \Sigma^{*} y_{2}\left\{r_{4}\{\Sigma^{*}\}\cdot \hat{y}_{2}\{\Sigma^{*}\}\right\}\Sigma^{*}. \end{array} $$

In order to prove that $P_{\text {com}}={\Upsilon }^{R_{\text {com}}}_{x,y}$, first assume that μ ∈ P _com(w) for some w ∈ Σ ^∗. Again, this means that there exists a $\hat {\mu }\in \hat {P}(w)$ with $\mu (x)=\hat {\mu }(x)$ and $\mu (y)=\hat {\mu }(y)$. In a slight abuse of notation, we identify the variables $x,\hat {x},y,\hat {y}$ with the corresponding subwords of w. In other words, we define $x,\hat {x},y,\hat {y}\in \Sigma ^{*}$ by $z := w_{\hat {\mu }(z)}$ for $z\in \{x,\hat {x},y,\hat {y}\}$. Furthermore, let $r=w_{\hat {\mu }(r_{1})}$. Due to the equality selections, we obtain the following word equations from α ₁ to α ₄:

$$\begin{array}{@{}rcl@{}} x &=& \hat{x} r = r \hat{x},\\ y &=& \hat{y} r = r \hat{y}. \end{array} $$

We explain this in detail for the first equation: First, note that due to the structure of α ₁, we know that $w_{\mu (x)} = w_{\mu (\hat {x})}\cdot w_{\mu (r_{1})}$ holds. Likewise, the structure of α ₂ ensures that $w_{\mu (x_{2})} = w_{\mu (r_{2})}\cdot w_{\mu (\hat {x}_{2})}.$ Due to the selections $\zeta ^=_{r_{1},r_{2},r_{3},r_{4}}$, $\zeta ^=_{x,x_{2}}$, and $\zeta ^=_{\hat {x},\hat {x}_{2}}$, the latter can be expressed as $w_{\mu (x)} = w_{\mu (r_{1})}\cdot w_{\mu (\hat {x})},$ and by combining the two equations while abusing the notation as explained above, we obtain $x=\hat {x}r=r\hat {x}$. The second equation is obtained analogously.

As $\hat {x}r = r \hat {x}$, there exists a word u ∈ Σ ^∗ with $r,\hat {x}\in \{u\}^{*}$. We choose the shortest u for which r ∈ {u}^∗. Then, due to $\hat {y} r = r \hat {y}$, we have that $\hat {y}\in \{u\}^{*}$ holds as well. This implies x, y ∈ {u}^∗, (w _μ(x), w _μ(y)) ∈ R _com, and $\mu \in {\Upsilon }^{R_{\text {com}}}_{x,y}(w)$. Again we can show analogously that every $\mu \in {\Upsilon }^{R_{\text {com}}}_{x,y}(w)$ can be extended into a $\hat {\mu }\in \hat {P}(w)$, which then proves μ ∈ P _com(w). □

In particular, this means that we can add $\zeta ^{R_{\text {com}}}$ and $\zeta ^{R_{\text {cyc}}}$ to core spanner representations, without leaving the class $[{\kern -2.3pt}[\textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}]{\kern -2.3pt}]$.

Example 3.8

Define L _imp := {w ⁿ∣w ∈ Σ⁺, n ≥ 2} and ρ := ${\zeta}^{R_{\text {com}}}_{x,y}$ (x{Σ⁺} ⋯ y{Σ⁺}). Then $\mathcal {L}(\rho )=L_{\text {imp}}$.

This does not imply that R _com can be used to select relations like R _pow := {(x, x ⁿ)∣n ≥ 0}. For example, if x := abab, then (x, y) ∈ R _com holds for all y ∈ {ab}^∗. The authors conjecture that R _pow is not selectable by core spanners.

Furthermore, the spanner that is constructed for R _com in the proof of Proposition 3.7 is more complicated than the corresponding word equation x y = y x. In fact, we constructed both spanners not from the equations, but from a characterization of the solutions. This appears to be necessary, due the fact that spanners need to relate their variables to an input w, while word equations use their variables without such restrictions. We shall see in Theorem 3.13 that, if this is kept in mind, core spanners can be used to simulate word equations.

Before we consider this topic further, we examine how word equations can simulate spanners, as this shall provide useful insights on some question of static analysis in Section 4.2. One drawback of word equations is that they are unable to express many comparatively simple regular languages; like A ^∗ for any non-empty A ⊂ Σ ^∗ (cf. Karhumäki et al. [26]). In order to overcome this problem, we consider the following extension:

Definition 3.9

Let η := (η _L, η _R) be a word equation. A regular constraints function ^{Footnote 1} is a function ${\mathcal {C}}$ that maps each x ∈ V a r s(η) to a nondeterministic finite automaton ${\mathcal {C}}(x)$. A solution σ of η is a solution of η under constraints ${\mathcal {C}}$ if $\sigma (x)\in \mathcal {L}({\mathcal {C}}(x))$ holds for every x ∈ V a r s(η).

Hence, regular constraints restrict the possible substitutions of a variable x to a regular language $\mathcal {L}(\mathcal {C}(x))$.

A syntactic extension of word equations is E C, the existential theory of concatenation, which is obtained by extending word equations with ∨, ∧, and existential quantification over variables. For example, R _cyc is expressed by the E C-formula

$$\varphi_{\text{cyc}}(x,y):= \exists z\colon (xz=zy). $$

Using appropriate coding techniques, one can transform every E C-formula into an equivalent word equation (see Diekert [6]). Although the transformations given in [6] can result in an exponential blowup, satisfiability of word equations and of E C-formulas can still be decided in PSPACE.

Like word equations, these formulas can be further extended by adding regular constraints. For each variable x and each nondeterministic finite automaton (NFA) A, the (regular) constraint L _A(x) is satisfied for a solution σ if $\sigma (x)\in \mathcal {L}(A)$. We call the resulting class of formulas EC ^reg, the existential theory of concatenation with regular constraints.

Example 3.10

Let A be an NFA with $\mathcal {L}(A)=\{\mathtt {a}\mathtt {b}^{i}\mathtt {a}\mid i\geq 1\}$, and define the EC ^reg-formula φ(x, y) := ∃z: (L _A(z) ∧ (∃z ₁, z ₂: x = z ₁ z z ₂) ∧ (∃z ₁, z ₂: y = z ₁ z z ₂)).

Then φ expresses the relation of all (x, y) that have a common subword z from $\mathcal {L}(A)$.

Note that we intentionally use L _A(x) for constraint symbols instead of ${\mathcal {C}}$, to emphasize the following distinction in the use of constraints: In word equations, every variable x is constrained to one language $L({\mathcal {C}}(x))$. In contrast to this, an EC ^reg-formula can use multiple constraint symbols for one variable (e. g., in the form of $L_{A}(x)\land L_{A^{\prime }}(x)$), or none at all.

Using the same techniques as for E C, one can transform EC ^reg-formulas into equivalent word equations with regular constraints. Again, the construction can result in an exponential blowup, but satisfiability of EC ^reg-formulas can still be decided in PSPACE (cf. Diekert [6]).

In order to simulate core spanners with EC ^reg-formulas, we introduce the following definition:

Definition 3.11

Let P be a core spanner with SVars(P) = {x ₁, …, x _n}, n ≥ 0, and let $\varphi (x_{w},{x^{P}_{1}}, {x^{C}_{1}}, {\ldots } {x^{P}_{n}}, {x^{C}_{n}})$ be an EC ^reg-formula. We say that φ realizes P if, for all $w, {w^{P}_{1}}, {w^{C}_{1}},\ldots ,{w^{P}_{n}},{w^{C}_{n}}\in \Sigma ^{*}$, we have that $\varphi (w, {w^{P}_{1}}, {w^{C}_{1}},\ldots ,{w^{P}_{n}},{w^{C}_{n}})=\mathtt {True}$ holds if and only if there is a μ ∈ P(w) with ${w^{P}_{k}} = w_{[1,i_{k}\rangle }$ and ${w^{C}_{k}} = w_{[i_{k},j_{k}\rangle }$ for each 1 ≤ k ≤ n, where [i _k, j _k〉 = μ(x _k).

This definition uses the fact that spans are always defined in relation to a word w. Note that every span [i, j〉 ∈ S p a n s(w) is characterized by the words w _[1,i〉 and w _[i,j〉. Hence, if μ ∈ [[ρ]](w), the EC ^reg-formula models μ(x _k) = [i _k,j _k〉 by mapping x _w to w, ${x^{P}_{k}}$ to $w_{[1,i_{k}\rangle }$, and ${x^{C}_{k}}$ to $w_{[i_{k},j_{k}\rangle }$. In the naming of the variables, C stands for content, and P for prefix. This allows us to model spanners in EC ^reg-formulas:

Theorem 3.12

There is an algorithm that, given $\rho \in \textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}$, computes in polynomial time an EC ^reg -formula φ _ρ that realizes [[ρ]].

Proof

Before presenting the construction that is the main part of proof, we briefly consider a technical detail of functional regex formulas. On an intuitive level, functional regex formulas guarantee that in each parse tree, every variable is assigned exactly once (hence, x{a} ⋅ x{a} is not functional). Consequently, it seems reasonable to conjecture that, if a functional regex formula contains a subformula of the form α ₁ ⋅ α ₂, then SVars(α ₁) ∩ SVars(α ₂) = ∅ must hold.

While this conjecture is true for regex formulas that do not contain ∅, it does not hold in general. For example, consider α := α ₁ ⋅ α ₂ with α ₁ := x{a} and α ₂ := (x{∅} ∨ b). Then x ∈ SVars(α ₁) ∩ SVars(α ₂), but as x{∅} can never be part of the label of a parse tree, the regex formula α is functional.

In order to exclude these fringe cases and simplify the construction of EC ^reg-formulas, we introduce the following concept: A regex formula α is ∅-reduced if α = ∅, or if α does not contain any occurrence of ∅. Using simple rewrite rules, we can observe the following. □

Claim 1

There is an algorithm that, given a regex formula α, computes in polynomial time an ∅-reduced regex formula α _R with [[α _R]] = [[α]].

Proof

In order to compute α _R, it suffices to rewrite α according to the following rewrite rules:

1.
∅ ^∗ → ε,
2.
$(\hat {\alpha }\mathbin {\vee }\emptyset )\to \hat {\alpha }$ and $(\emptyset \mathbin {\vee }\hat {\alpha })\to \hat {\alpha }$ for all regex formulas $\hat {\alpha }$,
3.
$(\hat {\alpha }\cdot \emptyset )\to \emptyset $ and $(\emptyset \cdot \hat {\alpha })\to \emptyset $ for all regex formulas $\hat {\alpha }$,
4.
x{∅} → ∅ for all variables x.

As ∅ is never part of a parse tree, we can observe that for all regex formulas α and β, where β is obtained by applying any number of these rewrite rules, [[β]] = [[α]] holds. Furthermore, one can use these rules to convert α into an equivalent and ∅-reduced α _R in polynomial time: If α is stored in a tree structure, it suffices to apply all applicable rules in bottom-up manner. $\square $ (Claim 1)

This allows us to proceed to the main part of the proof. Recall that our goal is a procedure that, given a $\rho \in \textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}$ with SVars(ρ) = {x ₁, …, x _n}, constructs an EC ^reg-formula $\varphi _{\rho }(x_{w},{x^{P}_{1}}, {x^{C}_{1}}, {\ldots } {x^{P}_{n}}, {x^{C}_{n}})$ such that for all $w, {w^{P}_{1}}, {w^{C}_{1}},\ldots , {w^{P}_{n}}, {w^{C}_{n}}\in \Sigma ^{*}$, we have that $\varphi _{\rho }(w, {w^{P}_{1}}, {w^{C}_{1}},\ldots ,{w^{P}_{n}},{w^{C}_{n}})=\mathtt {True}$ holds if and only if there is some μ ∈ P(w) with ${w^{P}_{k}} = w_{[1,i_{k}\rangle }$ and ${w^{C}_{k}} = w_{[i_{k},j_{k}\rangle }$ for each 1 ≤ k ≤ n, where [i _k, j _k〉 = μ(x _k).

In fact, this μ is always uniquely defined by w, the ${w^{P}_{k}}$, and the ${w^{C}_{k}}$. Based on this, we introduce some notation that simplifies our reasoning. Given w ∈ Σ ^∗ and μ ∈ P(w), we define the (2n + 1)-tuple $\mathbf {w}_{\mu }:= (w, {w^{P}_{1}}, {w^{C}_{1}},\ldots ,{w^{P}_{n}},{w^{C}_{n}})$ by ${w^{P}_{k}} := w_{[1,i_{k}\rangle }$ and ${w^{C}_{k}} := w_{[i_{k},j_{k}\rangle }$ as in the previous paragraph. For the other direction, we say that a (2n + 1)-tuple $\mathbf {w}= (w, {w^{P}_{1}}, {w^{C}_{1}},\ldots ,{w^{P}_{n}},{w^{C}_{n}})$ over Σ ^∗ is spanner compatible if, for all 1 ≤ k ≤ n the concatenated word ${w^{P}_{k}}\cdot {w^{C}_{k}}$ is a prefix of w. In this case, we define μ _w through μ _w(x _k) = [i _k, j _k〉 with $i_{k}:= |{w^{P}_{k}}|+1$ and $j_{k}:=|{w^{P}_{k}} {w^{C}_{k}}|+1$ for 1 ≤ k ≤ n. Note that these are one-to-one conversions if w is fixed: Every μ defines its unique spanner compatible w _μ, and every spanner compatible w defines its unique μ _w. We can now rephrase Definition 3.11 using this terminology, and observe that φ _ρ realizes [[ρ]] if and only if the following two statements hold:

1.
For all w ∈ (Σ ^∗)²ⁿ⁺¹, we have that φ _ρ(w) = True implies that w is spanner compatible and μ _w ∈ P(w).
2.
If μ ∈ P(w), then φ _ρ(w _μ) = True.

We now proceed to the most complicated part of this proof, the construction of EC ^reg-formulas from regex formulas. (The following sub-proof is rather lengthy, as it contains the full induction for the correctness proof. The main part of the proof continues on page 17).

Claim 2

There is an algorithm that, given a functional regex formula ρ ∈ RGX, constructs in polynomial time an EC ^reg-formula φ _ρ that realizes [[ρ]].

Proof

Due to Claim 1, we can assume without loss of generality that ρ is ∅-reduced. We define φ _ρ recursively as follows:

1.
If ρ does not contain any variables (i. e., n = 0), ρ is a proper regular expression. Using canonical transformation techniques, we can construct in polynomial time a non-deterministic finite automaton A with $\mathcal {L}(A)=\mathcal {L}(\rho )$, and we define
$$\varphi_{\rho}(x_{w}):= L_{A}(x_{w}).$$
Then φ _ρ realizes [[ρ]], as φ _ρ(w) = True holds if and only if $w\in \mathcal {L}(A)=\mathcal {L}(\rho )$, which holds if and only if μ _w ∈ [[ρ]](w).
2.
If ρ contains variables, we assume that SVars(ρ) = {x ₁, …, x _n} with n ≥ 1. By definition of regex formulas, no variable of ρ may occur inside of a Kleene star. Hence, we can distinguish three cases:
1. (a)
  ρ = ρ ₁ ∨ ρ ₂, where ρ ₁, ρ ₂ are functional regex formulas with SVars(ρ ₁) = SVars(ρ ₂) = SVars(ρ). We define
  $$\begin{array}{@{}rcl@{}} &&{}\varphi_{\rho}\left( x_{w},{x^{P}_{1}}, {x^{C}_{1}}, \ldots, {x^{P}_{n}}, {x^{C}_{n}}\right):=\\ &&\qquad\left( \varphi_{\rho_{1}}\left( x_{w},{x^{P}_{1}}, {x^{C}_{1}}, \ldots, {x^{P}_{n}}, {x^{C}_{n}}\right)\mathbin{\vee} \varphi_{\rho_{2}}\left( x_{w},{x^{P}_{1}}, {x^{C}_{1}}, \ldots, {x^{P}_{n}}, {x^{C}_{n}}\right)\right). \end{array} $$
  The intuition behind this formula should be clear; we proceed directly to proving the correctness. Assume that $\varphi _{\rho _{1}}$ and $\varphi _{\rho _{2}}$ realize [[ρ ₁]] and [[ρ ₁]], respectively. We choose any w ∈ Σ ^∗. To show the direction from logic to spanners, we extend w into a tuple w. By definition, φ _ρ(w) = True holds if and only if $\varphi _{\rho _{i}}(\mathbf {w})=\mathtt {True}$ for an i ∈ {1, 2}. As $\varphi _{\rho _{i}}$ realizes [[ρ _i]], the tuple w is spanner compatible, and μ _w ∈ [[ρ _i]](w) holds. For the other direction, we proceed analogously: If μ ∈ [[ρ _i]](w), then $\varphi _{\rho _{i}}(\mathbf {w}_{\mu })=\mathtt {True}$; hence, φ _ρ(w _μ) = True. We conclude that φ _ρ realizes [[ρ]].
2. (b)
  ρ = ρ ₁ ⋅ ρ ₂, where ρ ₁, ρ ₂ are functional regex formulas with SVars(ρ ₁) ∪ SVars(ρ ₂) = SVars(ρ) and SVars(ρ ₁) ∩ SVars(ρ ₂) = ∅. Without loss of generality, we can assume
  $$\begin{array}{@{}rcl@{}} {\textsf{SVars}\left( \rho_{1}\right)}&=&\{x_{1},\ldots,x_{m}\},\\ {\textsf{SVars}\left( \rho_{2}\right)}&=&\{x_{m+1},\ldots,x_{n}\} \end{array} $$
  with 0 ≤ m ≤ n. We define
  $$\begin{array}{@{}rcl@{}} &&\varphi_{\rho}\left( x_{w},{x^{P}_{1}}, {x^{C}_{1}}, \ldots, {x^{P}_{n}}, {x^{C}_{n}}\right) :=\\ &&\exists y_{1}, y_{2}, z^{P}_{m+1}, \ldots, {z^{P}_{n}}\colon \varphi_{I}\left( x_{w},{x^{P}_{1}}, {x^{C}_{1}}, \ldots, {x^{P}_{n}}, {x^{C}_{n}},y_{1},y_{2},z^{P}_{m+1}, \ldots, {z^{P}_{n}}\right), \end{array} $$
  where
  $$\begin{array}{@{}rcl@{}} &&\varphi_{I}(x_{w},{x^{P}_{1}}, {x^{C}_{1}}, \ldots, {x^{P}_{n}}, {x^{C}_{n}},y_{1},y_{2},z^{P}_{m+1}, \ldots, {z^{P}_{n}}):= \\ &&\qquad\qquad\qquad\left( {\vphantom{\underset{m+1\leq i \leq n}{\bigwedge}}}(x_{w} = y_{1}\cdot y_{2}) \wedge \varphi_{\rho_{1}}\left( y_{1},{x^{P}_{1}}, {x^{C}_{1}}, \ldots, {x^{P}_{m}}, {x^{C}_{m}}\right)\right.\\ &&\left.\qquad\qquad\wedge \varphi_{\rho_{2}}\left( y_{2},z^{P}_{m+1}, x^{C}_{m+1}, \ldots, {z^{P}_{n}}, {x^{C}_{n}}\right) \!\wedge \underset{m+1\leq i \leq n}{\bigwedge} \left( {x^{P}_{i}} = y_{1} \cdot {z^{P}_{i}}\right)\right). \end{array} $$
  The idea behind this formula is as follows: As ρ = ρ ₁ ⋅ ρ ₂, whenever [[ρ]](w) ≠ ∅ holds, w can be decomposed into w = w ₁ ⋅ w ₂, where w ₁ is parsed in ρ ₁, and w ₂ in ρ ₂. We store these words in the variables y ₁ and y ₂, respectively. For all variables in SVars(ρ ₁), the spans of the μ ∈ [[ρ ₁]](w ₁) are also spans in w (as w ₁ is a prefix of w). Hence, we can use the results from ρ ₁ unchanged. On the other hand, [[ρ ₂]](w ₂) determines spans in relation to w ₂. Hence, each span [i, j〉 ∈ S p a n s(w ₂) corresponds to the span [i + c, j + c〉 ∈ S p a n s(w), where c := |w ₁|. The variables ${z^{P}_{i}}$ represent the start of the span with respect to y ₂, and the conjunction of the equations $({x^{P}_{i}} = y_{1} \cdot {z^{P}_{i}})$ converts these starts into spans with respect to x _w.
  
  The correctness proof is a little lengthy, but straightforward. Assume that $\varphi _{\rho _{1}}$ and $\varphi _{\rho _{2}}$ realize [[ρ ₁]] and [[ρ ₂]]. Assume that φ _ρ(w) = True for some tuple $\mathbf {w}=(w,{w^{P}_{1}},{w^{C}_{1}},\ldots ,{w^{P}_{n}},{w^{C}_{n}})$. By definition of φ _ρ, the tuple w can be extended into $\mathbf {w}^{\prime }=(w,{w^{P}_{1}},{w^{C}_{1}},\ldots ,{w^{P}_{n}},{w^{C}_{n}},u_{1},u_{2},v^{P}_{m+1},\ldots ,{v^{P}_{n}})$ with φ _I(w′) = True. By observing the structure of φ _I, we obtain:
  1. i.
    w = u ₁ ⋅ u ₂,
  2. ii.
    ${w^{P}_{i}} = u_{1} \cdot {v^{P}_{i}}$ for m + 1 ≤ i ≤ n,
  3. iii.
    $\varphi _{\rho _{1}}(\mathbf {u}_{1})=\mathtt {True}$ and $\varphi _{\rho _{2}}(\mathbf {u}_{2})=\mathtt {True}$, where
    $$\begin{array}{@{}rcl@{}} \mathbf{u}_{1} &:=& \left( u_{1}, {w^{P}_{1}}, {w^{C}_{1}}, \ldots, {w^{P}_{m}}, {w^{C}_{m}}\right),\\ \mathbf{u}_{2} &:=& \left( u_{2}, v^{P}_{m+1}, w^{C}_{m+1}, \ldots, {v^{P}_{n}}, {w^{C}_{n}}\right). \end{array} $$
  From this and our initial assumption, we can conclude that w is spanner compatible, and that $\mu _{\mathbf {u}_{1}}\in [{\kern -2.3pt}[ \rho _{1} ]{\kern -2.3pt}](u_{1})$ and $\mu _{\mathbf {u}_{2}}\in [{\kern -2.3pt}[ \rho _{2} ]{\kern -2.3pt}](u_{2})$ must hold. Thus, there exits corresponding parse trees T ₁ and T ₂ with respective root labels (u ₁, ρ ₁) and (u ₂, ρ ₂). We combine these into a new parse tree T by adding a new root node (w, ρ ₁ ⋅ ρ ₂) that has T ₁ as left and T ₂ as right child. As described in Definition 2.11, this tree T defines the w-tuple
  $$\mu^{T}(x_{k})\,=\,\left\{\begin{array}{ll}\left[\right.i_{k},j_{k}\rangle & \text{ if } 1\leq k\leq m \text{ and } \mu_{1}(x_{k})=\left[\right.i_{k},j_{k}\rangle,\\ \left[\right.i_{k}\,+\,|u_{1}|,j_{k}+|u_{1}|\rangle & \text{ if } m+1\leq k\leq n \text{ and } \mu_{2}(x_{k})=\left[\right.i_{k},j_{k}\rangle. \end{array}\right. $$
  In other words, for the variables x ₁ to x _m, the w-tuple μ ^T simulates μ ₁ in u ₁, the left part of w; and for the variables x _m+1 to x _n, it simulates μ ₂ in u ₂, the right part of w. Hence, all spans for the latter variables are shifted by |u ₁|. Using the equalities ${w^{P}_{i}} = u_{1} \cdot {v^{P}_{i}}$ from above, we obtain μ ^T = μ _w, which concludes this direction of the correctness proof. The other direction proceeds analogously: Given μ ∈ [[ρ]], we can use the corresponding parse tree T to factorize w into u ₁ and u ₂. We then shift the spans of the variables x _m+1 to x _n by |u ₁|, and use this to obtain u ₂ with $\varphi _{\rho _{2}}(\mathbf {u}_{2})=\mathtt {True}$. No effort is necessary for u ₁, and we can then combine u ₁ and u ₂ into a tuple w with φ _ρ(w) = True and w = w _μ. Thus, φ _ρ realizes [[ρ]].
3. (c)
  $\rho = x\{\hat {\rho }\}$ for some x ∈ {x ₁, …, x _n}, and $\hat {\rho }$ is a functional regex formula with $\textsf{SVars}(\hat {\rho }) = \textsf{SVars}(\rho )\setminus \{x\}$. Without loss of generality, let x = x ₁. We define
  $$\begin{array}{@{}rcl@{}} &&\varphi_{\rho}(x_{w},{x^{P}_{1}}, {x^{C}_{1}}, \ldots, {x^{P}_{n}}, {x^{C}_{n}}) :=\\ &&\qquad\qquad\quad\left( \left( {x^{P}_{1}}=\varepsilon\right) \!\wedge\! \left( {x^{C}_{1}} = x_{w}\right) \!\wedge \varphi_{\hat{\rho}}\left( x_{w},{x^{P}_{2}}, {x^{C}_{2}}, \ldots, {x^{P}_{n}}, {x^{C}_{n}}\right)\right). \end{array} $$
  The formula uses the fact that in this case, for each μ ∈ [[ρ]](w), we have that μ(x ₁) = [1,|w| + 1〉 must hold. This is encoded by ${x^{P}_{1}}=\varepsilon $ and ${x^{C}_{1}} = w$. For the correctness proof, assume that $\varphi _{\hat {\rho }}$ realizes $[{\kern -2.3pt}[ \hat {\rho } ]{\kern -2.3pt}]$. Going from logic to spanners, assume that $\mathbf {w}=(w,{w^{P}_{1}},{w^{C}_{1}},\ldots ,{w^{P}_{n}},{w^{C}_{n}})$ and φ _ρ(w) = True. Due to the structure of the formula, we know that ${w^{P}_{1}} =\varepsilon $, ${w^{C}_{1}}=w$, and $\varphi _{\hat {\rho }}(\hat {\mathbf {w}})=\mathtt {True}$ for $\hat {\mathbf {w}}=(w,{w^{P}_{2}},{w^{C}_{2}},\ldots ,{w^{P}_{n}},{w^{C}_{n}})$. As $\varphi _{\hat {\rho }}$ realizes $[{\kern -2.3pt}[ \hat {\rho } ]{\kern -2.3pt}]$, we know that $\hat {\mathbf {w}}$ is spanner compatible, and $\mu _{\hat {\mathbf {w}}}\in [{\kern -2.3pt}[ \hat {\rho } ]{\kern -2.3pt}](w)$. Due to this and the definition of ρ, we observe μ ∈ [[ρ]](w) for the w-tuple
  $$\mu(x_{k}):= \left\{\begin{array}{ll} \left[\right.1,|w|+1\rangle & \text{if } k=1,\\ \mu_{\hat{\mathbf{w}}}(x_{k}) & \text{if } k>1. \end{array}\right. $$
  As μ = μ _w, we conclude this direction of the proof. For the other direction, let μ ∈ [[ρ]](w). By definition, μ(x ₁) = [1,|w| + 1〉 and $\hat {\mu }\in [{\kern -2.3pt}[ \hat {\rho } ]{\kern -2.3pt}]$ for $\hat {\mu }=\mu |_{\{x_{2},\ldots ,x_{n}\}}$. Due to our initial assumption, $\varphi _{\hat {\rho }}(\mathbf {w}_{\hat {\mu }})=\mathtt {True}$ must hold. Note that $\mathbf {w}_{\hat {\mu }}=(w,{w^{P}_{2}},{w^{C}_{2}},\ldots ,{w^{P}_{n}},{w^{C}_{n}})$, and let $\mathbf {w}:= (w,\varepsilon , w, {w^{P}_{2}},{w^{C}_{2}},\ldots ,{w^{P}_{n}},{w^{C}_{n}})$. Then φ _ρ(w) = True; and as w = w _μ, this concludes this direction. Thus, φ _ρ realizes [[ρ]].

Finally, note that the size of φ _ρ is polynomial in the size of ρ. More importantly, the construction of φ _ρ follows the syntax of ρ, and does not requires expensive additional computations. Hence, φ _ρ can be computed in polynomial time. $\square $ (Claim 2)

Using Claim 2, we have the conversion for RGX, the class of (functional) regex formulas. As final step of the proof, we extended this to all core spanner representations (i. e., the full class $\textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}$). Consider an arbitrary core spanner representations $\rho \in \textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}$ with SVars(ρ) = {x ₁, …, x _n}, n ≥ 0. We distinguish the following cases:

1.
ρ is a regex formula. This case is covered in Claim 2.
2.
$\rho = \pi _{Y} \hat {\rho }$, with Y = SVars(ρ) and ${\textsf{SVars}\left (\hat {\rho }\right )}\supseteq {\textsf{SVars}\left (\rho \right )}$. Assume without loss of generality that $\textsf{SVars}(\hat {\rho })=\{x_{1},\ldots ,x_{n+m}\}$ with m ≥ 0. We define
$$\begin{array}{@{}rcl@{}} &&\varphi_{\rho}\left( x_{w},{x^{P}_{1}}, {x^{C}_{1}}, \ldots, {x^{P}_{n}}, {x^{C}_{n}}\right) :=\\ &&\qquad\qquad\exists x^{P}_{n+1},x^{C}_{n+1},\ldots,x^{P}_{n+m},x^{C}_{n+m}\colon \varphi_{\hat{\rho}}\left( x_{w},{x^{P}_{1}}, {x^{C}_{1}}, \ldots, x^{P}_{n+m}, x^{C}_{n+m}\right) \end{array} $$
Regarding the correctness, assume that $\varphi _{\hat {\rho }}$ realizes $[{\kern -2.3pt}[ \hat {\rho } ]{\kern -2.3pt}]$. Hence, if $\hat {\mu }\in [{\kern -2.3pt}[ \hat {\rho } ]{\kern -2.3pt}](w)$, we have $\varphi _{\hat {\rho }}(\mathbf {w}_{\hat {\mu }})=\mathtt {True}$. This means that for $\mu := \hat {\mu }|_{Y}$, we know that φ _ρ(w _μ) = True holds as well. Likewise, if φ _ρ(w) = True, there exists an extension $\hat {\mathbf {w}}$ of w with $\mu _{\hat {\mathbf {w}}}\in [{\kern -2.3pt}[ \hat {\rho } ]{\kern -2.3pt}](w)$. As $\hat {\mathbf {w}}$ is spanner compatible, so is w. Thus, we observe $\mu _{\mathbf {w}}=\mu _{\hat {\mathbf {w}}}|_{Y}$ and μ _w ∈ [[ρ]](w). Hence, φ _ρ realizes [[ρ]].
3.
$\rho = \zeta ^=_{\mathbf {x}} \hat {\rho }$, with $\mathbf {x}\in ({\textsf{SVars}\left (\hat {\rho }\right )})^{m}$, 2 ≤ m ≤ n, and $\textsf{SVars}(\rho )=\textsf{SVars}(\hat {\rho })$. Assume without loss of generality that x = (x ₁, …, x _m). We define
$$\begin{array}{@{}rcl@{}} &&\varphi_{\rho}\left( x_{w},{x^{P}_{1}}, {x^{C}_{1}}, \ldots, {x^{P}_{n}}, {x^{C}_{n}}\right) :=\\ &&\qquad\qquad\qquad\qquad\quad\left( \varphi_{\hat{\rho}}\left( x_{w},{x^{P}_{1}}, {x^{C}_{1}}, \ldots, {x^{P}_{n}}, {x^{C}_{n}}\right) \!\wedge\! \underset{2\leq i \leq m}{\bigwedge} \left( {x^{C}_{1}} \,=\, {x^{C}_{i}}\right)\right). \end{array} $$
Recall that $\zeta ^=_{x_{i},x_{j}}$ only checks whether $w_{\mu (x_{i})}=w_{\mu (x_{j})}$ holds, not whether μ(x _i) = μ(x _j). This is equivalent to checking whether ${x^{C}_{i}}={x^{C}_{j}}$ holds.

We only proof the correctness for m = 2, the other cases proceed analogously (or by reducing them to this binary case). Assume that $\varphi _{\hat {\rho }}$ realizes $[{\kern -2.3pt}[ \hat {\rho } ]{\kern -2.3pt}]$. Let μ ∈ [[ρ]](w). Then $w_{\mu (x_{1})}=w_{\mu (x_{2})}$ and $\mu \in [{\kern -2.3pt}[ \hat {\rho } ]{\kern -2.3pt}](w)$ hold by definition. The latter implies $\varphi _{\hat {\rho }}(\mathbf {w})=\mathtt {True}$. Together with the former and the structure of φ _ρ, we conclude φ _ρ(w) = True.

For the other direction, let φ _ρ(w) = True. By the structure of φ _ρ, we know that $\varphi _{\hat {\rho }}(\mathbf {w})=\mathtt {True}$ and ${w^{C}_{1}}={w^{C}_{2}}$. As $\varphi _{\hat {\rho }}$ realizes $[{\kern -2.3pt}[ \hat {\rho } ]{\kern -2.3pt}]$, we have that w is spanner compatible, and $\mu _{\mathbf {w}}\in [{\kern -2.3pt}[ \hat {\rho } ]{\kern -2.3pt}](w)$. Due to ${w^{C}_{1}}={w^{C}_{2}}$, this implies μ _w ∈ [[ρ]](w) and concludes the proof that φ _ρ realizes [[ρ]].
4.
ρ = (ρ ₁ ∪ ρ ₂), with SVars(ρ ₁) = SVars(ρ ₂) = SVars(ρ). Let
$$\begin{array}{@{}rcl@{}} &&\varphi_{\rho}\left( x_{w},{x^{P}_{1}}, {x^{C}_{1}}, \ldots, {x^{P}_{n}}, {x^{C}_{n}}\right) :=\\ &&\qquad\qquad\left( \varphi_{\rho_{1}}\left( x_{w},{x^{P}_{1}}, {x^{C}_{1}}, \ldots, {x^{P}_{n}}, {x^{C}_{n}}\right) \mathbin{\!\vee\!} \varphi_{\rho_{2}}\left( x_{w},{x^{P}_{1}}, {x^{C}_{1}}, \ldots, {x^{P}_{n}}, {x^{C}_{n}}\right)\right). \end{array} $$
In this case, the construction and the correctness proof are identical to case 2a (disjunction) in the proof of Claim 2.
5.
ρ = (ρ ₁ ⋈ ρ ₂) with SVars(ρ) = SVars(ρ ₁) ∪ SVars(ρ ₂). We assume without loss of generality that SVars(ρ ₁) = {x ₁, …, x _l} and SVars(ρ ₂) = {x _m, …, x _n} with 0 ≤ l ≤ n, 1 ≤ m ≤ n + 1, and m ≤ l + 1. Note that this implies SVars(ρ ₁) ∩ SVars(ρ ₂) = {x _m, …, x _l}, and SVars(ρ ₁) ∩ SVars(ρ ₂) = ∅ if and only if m = l + 1. We define
$$\begin{array}{@{}rcl@{}} &&\varphi_{\rho}\left( x_{w},{x^{P}_{1}}, {x^{C}_{1}}, \ldots, {x^{P}_{n}}, {x^{C}_{n}}\right) :=\\ &&\qquad\qquad\left( \varphi_{\rho_{1}}\left( x_{w},{x^{P}_{1}}, {x^{C}_{1}}, \ldots, {x^{P}_{l}}, {x^{C}_{l}}\right) \!\wedge\! \varphi_{\rho_{2}}\left( x_{w},{x^{P}_{m}}, {x^{C}_{m}}, \ldots, {x^{P}_{n}}, {x^{C}_{n}}\right)\right). \end{array} $$
The definition of ⋈ requires that μ ∈ [[ρ]](w) holds if and only if there are μ ₁ ∈ [[ρ ₁]](w) and μ ₂ ∈ [[ρ ₂]](w) with μ ₁(x _i) = μ ₂(x _i) for all i ∈ {m, …, l}. For each of these variables x _i, we have that $\varphi _{\rho _{1}}$ and $\varphi _{\rho _{2}}$ model the span with the same variables ${x^{P}_{i}}$ and ${x^{C}_{i}}$.

To prove the correctness, assume that $\varphi _{\rho _{1}}$ and $\varphi _{\rho _{2}}$ realize [[ρ ₁]] and [[ρ ₂]], respectively. Let μ ∈ [[ρ]](w). Then there exist μ ₁ ∈ [[ρ ₁]](w) and μ ₂ ∈ [[ρ ₂]](w) with $\mu _{1}=\mu |_{\{x_{1},\ldots ,x_{l}\}}$ and $\mu _{2}=\mu |_{\{x_{m},\ldots ,x_{n}\}}$, which implies μ ₁(x _k) = μ ₂(x _k) for m ≤ k ≤ l. Now, in order to talk about the components of $ \mathbf {w}_{\mu _{1}}$ and $ \mathbf {w}_{\mu _{2}}$, we name the components of the tuples as $\mathbf {w}_{\mu _{1}}=(w,{w^{P}_{1}},{w^{C}_{1}},\ldots ,{w^{P}_{l}},{w^{C}_{l}})$ and $\mathbf {w}_{\mu _{2}}=(w,{w^{P}_{m}},{w^{C}_{m}},\ldots ,{w^{P}_{n}},{w^{C}_{n}})$. As μ ₁ and μ ₂ agree on their common variables, we can combine this to $\mathbf {w}:= (w,{w^{P}_{1}},{w^{C}_{1}},\ldots ,{w^{P}_{n}},{w^{C}_{n}})=\mathbf {w}_{\mu }$. As each $\varphi _{\rho _{i}}$ realizes [[ρ _i]], we know that $\varphi _{\rho _{i}}(\mathbf {w}_{\mu _{i}})=\mathtt {True}$. Hence, φ _ρ(w _μ) = φ _ρ(w) = True. This concludes this direction.

For the other direction, assume that φ _ρ(w) = True. Due to the structure of the formula, this implies $\varphi _{\rho _{i}}(\mathbf {w}_{i})=\mathtt {True}$, where $\mathbf {w}_{1}:= (w,{w^{P}_{1}},{w^{C}_{1}},\ldots ,{w^{P}_{l}},{w^{C}_{l}})$ and $\mathbf {w}_{2}:= (w,{w^{P}_{m}},{w^{C}_{m}},\ldots ,{w^{P}_{n}},{w^{C}_{n}})$. As $\varphi _{\rho _{i}}$ realizes [[ρ _i]], we know that w _i is spanner compatible, and $\mu _{\mathbf {w}_{i}}\in [{\kern -2.3pt}[ \rho _{i} ]{\kern -2.3pt}](w)$ . Due to the former, w is also spanner compatible. Due to the latter, we know that μ _w ∈ [[ρ]](w), as $\mu _{\mathbf {w}}(x_{k})=\mu _{\mathbf {w}_{1}}(x_{k})=\mu _{\mathbf {w}_{2}}(x_{k})$ for all m ≤ k ≤ l. Hence, φ _ρ realizes [[ρ]].

The formula φ _ρ can be derived from ρ without requiring further computation, and its size is polynomial in the size of ρ. Hence, φ _ρ can be constructed in polynomial time.

As we shall see in Section 4.2, this result allows us to find upper bounds on two problems from the static analysis of spanners. We now examine how spanners can simulate word equations (and, thereby, also EC ^reg-formulas). As discussed above, spanners need to relate their variables to an input word. Hence, we only state the following result, which is a weaker form of simulation than for the other direction:

Theorem 3.13

Every word equation η := (η _L, η _R) with regular constraints ${\mathcal {C}}$ can be converted effectively into a ρ ∈ RGX $\textsf{RGX}^{\{\zeta ^=,\times \}}$ with SVars(ρ) ⊇ V a r s(η) such that for all w ∈ Σ ^∗, there is a solution σ of η under constraints $\mathcal {C}$ with w = σ(η _L) = σ(η _R) if and only if there is a μ ∈ [[ρ]](w) with σ(x) = w _μ(x) for all x ∈ V a r s(η).

Proof

As each of the two sides of a word equation is a pattern, we can transform those into regex formulas by using the a slightly adapted version of the conversion procedure from the proof of Theorem 3.3. Only two changes are made. Firstly, instead of binding a variable x to some Σ ^∗, we respect the constraints by using a regular expression for the language $\mathcal {L}({\mathcal {C}}(x))$. Secondly, in order to ensure SVars(ρ) ⊇ V a r s(η), the first occurrence of a variable x is not represented by x ₁, but by x.

Assume that η _L = α ₁⋯α _m and η _R = α _m+1⋯α _n with $m,n\in \mathbb {N}$, m + 1 ≤ n, and α ₁, …, α _n ∈ (Σ ∪ X). We construct regex formulas $\hat {\eta }_{L}:= \hat {\alpha }_{1}{\cdots } \hat {\alpha }_{m}$ and $\hat {\eta }_{R}:= \hat {\alpha }_{m+1}{\cdots } \hat {\alpha }_{n}$, where for each position in 1 ≤ i ≤ n, we define $\hat {\alpha }_{i}$ as follows:

1.
If α _i is a terminal (i. e., there is an a ∈ Σ with α _i = a), let $\hat {\alpha }_{i}:= a$.
2.
If α _i is a variable (i. e., there is an x ∈ X with α _i = x), let γ be a regular expression with $\mathcal {L}(\gamma )=\mathcal {L}(\mathcal {C}(x))$. Furthermore, let j := |α ₁⋯α _i|_x.
1. (a)
  If j = 1, define $\hat {\alpha }_{i}:= x\{\gamma \}$
2. (b)
  If j ≥ 2, define $\hat {\alpha }_{i}:= x_{j}\{\gamma \}$ (where x _j ∈ SVars is a new variable).

This ensures that ${\textsf{SVars}\left (\hat {\eta }_{L}\right )}$ and ${\textsf{SVars}\left (\hat {\eta }_{R}\right )}$ are disjoint. We then construct a sequence S of string equality selections appropriately: For every x ∈ V a r s(η) with k := |η _L η _R|_x ≥ 2, the sequence S includes a selection $\zeta ^=_{x,x_{2},\ldots ,x_{k}}$.

Finally, we define $\rho := S(\hat {\eta }_{L}\times \hat {\eta }_{R})$.

In order to prove that this construction is correct, we show that for all w ∈ Σ ^∗, μ ∈ [[ρ]](w) holds if and only if there is a solution σ of η under constraints ${\mathcal {C}}$ with

1.
w = σ (η _L) = σ(η _R), and
2.
σ (x) = w _μ(x) for all x ∈ V a r s(η).

We begin with the if-direction. Assume that σ is a solution of η under constraints ${\mathcal {C}}$. Let w := σ(η _L) (which implies w = σ(η _R), as σ is a solution of η). We use this to define a w-tuple μ as follows: Due to our construction, each variable $\hat {x}\in {\textsf{SVars}\left (\rho \right )}$ corresponds to a uniquely defined α _i with α _i = x. If 1 ≤ i ≤ m, then $\hat {x}$ occurs in $\hat {\eta _{L}}$, and if m + 1 ≤ i ≤ n, then $\hat {x}$ occurs in $\hat {\eta _{R}}$. We now define $\mu (\hat {x}):= [l,r\rangle $, where the choice of l and r depends on this distinction:

If $\hat {x}$ occurs in $\hat {\eta _{L}}$, let l := | σ(α ₁⋯α _i−1)| + 1 and r := | σ(α ₁⋯α _i)| + 1,
If $\hat {x}$ occurs in $\hat {\eta _{R}}$, let l := | σ(α _m+1⋯α _i−1)| + 1 and r := | σ(α _m+1⋯α _i)| + 1.

Either way, we know that $w_{\mu (\hat {x})}=\sigma (x)$ holds, which implies $w_{\mu (\hat {x})}\in \mathcal {L}({\mathcal {C}}(x))$. Analogously, we can use σ to construct parse trees for $(w,\hat {\eta }_{L})$ and $(w,\hat {\eta }_{R})$. This allows us to conclude $\mu \in [{\kern -2.3pt}[ \hat {\eta }_{L}\times \hat {\eta }_{R} ]{\kern -2.3pt}](w)$. Furthermore, for every selection $\zeta ^=_{x,x_{2},\ldots ,x_{k}}$ in S, we know from the construction that x and all x _i (1 ≤ i ≤ k) refer to the same x ∈ V a r s(η), which means that $w_{\mu (x)}=w_{\mu (x_{i})}=\sigma (x)$ holds. Hence, for each of these selections, $\mu \in [{\kern -2.3pt}[ \hat {\eta }_{L}\times \hat {\eta }_{R} ]{\kern -2.3pt}](w)$ implies $\mu \in [{\kern -2.3pt}[ \zeta ^=_{x,x_{2},\ldots ,x_{k}}(\hat {\eta }_{L}\times \hat {\eta }_{R}) ]{\kern -2.3pt}](w)$. Thus, $\mu \in [{\kern -2.3pt}[ S(\hat {\eta }_{L}\times \hat {\eta }_{R}) ]{\kern -2.3pt}](w)$, which is equivalent to μ ∈ [[ρ]](w) and concludes this direction of the proof.

For the only if-direction, assume that μ ∈ [[ρ]](w). We now define a pattern substitution σ by σ(a) := a for all a ∈ Σ, and σ(x) := w _μ(x) for all x ∈ V a r s(η). By our construction, μ(x) is derived from x{γ}, where $\mathcal {L}(\gamma )=\mathcal {L}({\mathcal {C}}(x))$ must hold, which means that $w_{\mu (x)}\in \mathcal {L}({\mathcal {C}}(x))$, and hence $\sigma (x)\in \mathcal {L}({\mathcal {C}}(x))$. All that remains to be shown is that σ(η _L) = σ(η _R) = w. In order to prove this, we first define $\hat {w}_{L} = \hat {w}_{1}{\cdots } \hat {w}_{m}$ and $\hat {w}_{R} = \hat {w}_{m+1}{\cdots } \hat {w}_{n}$, where the $\hat {w}_{i}$ with 1 ≤ i ≤ n are defined as follows:

1.
If α _i = a ∈ Σ, let $\hat {w}_{i}:= a$. Then $\hat {w}_{i}=\hat {\alpha }_{i}$ and $\hat {w}=\sigma (\alpha _{i})$ hold by definition.
2.
If α _i = x ∈ X, let j := |α ₁⋯α _i|_x. We distinguish two cases.
1. (a)
  If j = 1, let $\hat {w}_{i} = w_{\mu (x)}$. Then $\sigma (\alpha _{i})=\hat {w}_{i}$ holds by definition.
2. (b)
  If j ≥ 2, let $\hat {w}_{i} = w_{\mu (x_{j})}$. Observe that S contains the selection $\zeta ^=_{x,x_{2},\ldots ,x_{k}}$. Hence, $w_{\mu (x_{j})}=w_{\mu (x)}$ holds, which implies $\sigma (\alpha _{i})=\hat {w}_{i}$.

Now note that the $\hat {w}_{i}$ correspond to the labels of the parse trees that have root labels $(w,\hat {\eta }_{L})$ and $(w,\hat {\eta }_{R})$. Hence, $\hat {w}_{L}=w$ and $\hat {w}_{R}=w$ must hold. Furthermore, we have $\hat {w}_{i}=\sigma (\alpha _{i})$ for all 1 ≤ i ≤ m. This allows us to conclude

$$\begin{array}{@{}rcl@{}} \sigma(\eta_{L}) &=& \sigma(\alpha_{1}{\cdots} \alpha_{m}){\kern40pt} \sigma(\eta_{R}) = \sigma(\alpha_{m+1}{\cdots} \alpha_{n})\\ &=& \hat{w}_{1}{\cdots} \hat{w}_{m}= \hat{w}_{L},{\kern47.5pt} =\hat{w}_{m+1}{\cdots} \hat{w}_{n}=\hat{w}_{R}. \end{array} $$

We observe σ(η _L) = σ(η _R) = w, which concludes this direction of the proof. □

While this form of simulation is weaker (as w has to be present), it still shows that the constructed spanner is satisfiable if and only if the word equation (with constraints) is satisfiable. Furthermore, the computed (V, w)-relations encode solutions of the equation.

Example 3.14

Let a, b ∈ Σ and define η := (x y, y x) with $\mathcal {L}({\mathcal {C}}(x))=\mathcal {L}(\mathtt {aab^{+}})$ and $\mathcal {L}({\mathcal {C}}(y))=\Sigma ^{+}$. The construction from the proof of Theorem 3.13 results in

$$\rho:= \zeta^=_{x,x_{2}}\zeta^=_{y,y_{2}} (\hat{\eta}_{L}\times \hat{\eta}_{R}),$$

where $\hat {\eta }_{L}:= x\{\mathtt {aab^{+}}\}\cdot y\{\Sigma ^{+}\}$ and $\hat {\eta }_{R}:= y_{2}\{\Sigma ^{+}\}\cdot x_{2}\{\mathtt {aab^{+}}\}.$

The only reason that this construction is not necessarily possible in polynomial time is that regular constraints are specified with NFAs, while core spanners use regular expressions, which can lead to an exponential increase in the size.

There is a similar construction that does not use the join operator: By adding new variables z ₁, z ₂, we can construct

$$\hat{\rho}:= \zeta^=_{x,x_{2}}\zeta^=_{y,y_{2}}\zeta^=_{z_{1},z_{2}}(z_{1}\{\hat{\eta}_{L}\}z_{2}\{\hat{\eta}_{R}\}), $$

which behaves almost like ρ; the only difference that the solution is encoded in w = σ(η _L ⋅ η _R), instead of σ (η _L).

3.3 Xregexes

As shown by Fagin et al. [7], there are languages that are recognized by xregexes, but not by core spanners. In order to prove this, [7] introduced the so-called “uniform-0-chunk”-language L _uzc: Assuming 0, 1 ∈ Σ, L _uzc is defined as the language of all w = s ₁ ⋅ t ⋅ s ₂ ⋅ t⋯s _n−1 ⋅ t ⋅ s _n, where n > 0, s ₁, …, s _n ∈ {1}⁺, and t ∈ {0}⁺. Then $\mathcal {L}(\alpha _{\text {uzc}})=L_{\text {uzc}}$ holds for the xregex α _uzc := 1⁺ ⋅ x{0^∗} ⋅ (1⁺ ⋅& x)^∗ ⋅ 1⁺, but no core spanner recognizes L _uzc.

Considering that the syntax of regex formulas does not allow the use of variables inside a Kleene star (or plus), this inexpressibility result might be considered expected, as α _uzc has an occurrence of &x inside a Kleene star. This raises the question whether xregexes that restrict variables in a similar manner can still recognize languages that core spanners cannot. In order to examine this question, we define the following subclass of xregexes:

Definition 3.15

An xregex α is variable star-free (short: vstar-free) if, for every β ∈ S u b(α) with β = γ ^∗, no subexpression of γ is a variable binding or a variable reference. We denote the class of all vstar-free xregexes by v s f X R.

As we shall see in Theorem 3.21 below, every language that is recognized by a vstar-free xregex is also recognized by a core spanner. While this observation might be considered not very surprising, its proof needs to deal with some technicalities. In particular, one needs to deal with expressions like α := x{Σ ^∗}⋅ (&x ∨& x&x). A conversion in the spirit of Theorem 3.3 would need to replace the &x with distinct variables and ensure equality with selections; but as the disjunction contains subexpressions with distinct numbers of occurrences of &x, we would not be able to ensure functionality of the resulting regex formula. We avoid these problems by working with the following syntactically restricted class of vstar-free xregexes:

Definition 3.16

An α ∈ v s f X R is an xregex path if, for every β ∈ S u b(α) with β = (γ ₁ ∨ γ ₂), no subexpression of γ ₁ or γ ₂ is a variable binding or a variable reference. We denote the class of all xregex paths by X R P.

Intuitively, an xregex path α ∈ X R P can be understood as a concatenation α = α ₁⋯α _n, where each α _i is either a proper regular expression, a variable reference, or a variable binding of the form $\alpha _{i} = x\{\hat {\alpha }\}$, where $\hat {\alpha }$ is also an xregex path. By “multiplying out” disjunctions that contain variables, we can convert every vstar-free xregex into a disjunction of xregex paths.

Lemma 3.17

There is an algorithm that, given α ∈ v s f X R, computes α ₁, …, α _n ∈ X R P with $\mathcal {L}(\alpha )=\bigcup _{i=1}^{n}\mathcal {L}(\alpha _{i})$.

Proof

If a vstar-free xregex α is not an xregex path, there exists at least one x ∈ V a r s(α) and at least one subexpression β ∈ S u b(α) with β ≠ α such that

1.
β is a disjunction; i. e., β = (γ ₁ ∨ γ ₂) for some γ ₁, γ ₂ ∈ v s f X R,
2.
β contains a variable binding x{⋯} or a variable reference &x.

We now rewrite α into two vstar-free xregexes α ₁ and α ₂, by replacing β with γ ₁ or γ ₂, respectively. We observe that this rewriting step does not change the language: □

Claim 1

$\mathcal {L}(\alpha )=\mathcal {L}(\alpha _{1})\cup \mathcal {L}(\alpha _{2})$

Proof

If $w\in \mathcal {L}(\alpha )$, there exists an α-parse tree T for w; in other words, the root of T is labelled with (w, α). Recall that α is vstar-free. Hence, we know that T uses the occurrence of β that was rewritten to create α ₁ and α ₂ at most once (in order to be able to use the occurrence multiple times, α would need to contain a star around β).

This allows us to distinguish two possibilities: If T does not use this occurrence of β at all, we can immediately transform T into an α _i-parse tree T _i (i ∈ {1, 2}) by replacing the root label with (w, α _i), and changing all children accordingly. Hence, $w\in \mathcal {L}(\alpha _{i})$ holds.

On the other hand, if T uses this occurrence of β, then there exists a uniquely defined node v in T that is labeled with $(\hat {w},\beta )$ for some word $\hat {w}\in \Sigma ^{*}$. Furthermore, this node corresponds to the occurrence of β that was rewritten in α ₁ and α ₂. By definition, v has exactly one child $\hat {v}$ that is labeled with either $(\hat {w},\gamma _{i})$, where i ∈ {1, 2}. We rewrite T into a α _i-parse tree T _i by removing v (i. e., $\hat {v}$ replaces v), relabeling the root of T to (w, α _i), and changing all labels between the root and $\hat {v}$ accordingly. As T _i is a α _i-parse tree for w, we have that $w\in \mathcal {L}(\alpha _{i})$ holds. This proves $\mathcal {L}(\alpha )\subseteq \mathcal {L}(\alpha _{1})\cup \mathcal {L}(\alpha _{2})$.

In order to prove $\mathcal {L}(\alpha )\supseteq \mathcal {L}(\alpha _{1})\cup \mathcal {L}(\alpha _{2})$, we proceed analogously: If $w\in \mathcal {L}(\alpha _{1})\cup \mathcal {L}(\alpha _{2})$, we can transform a α _i-parse tree for w into an α-parse tree by inserting a node $(\hat {w},\beta )$ (if necessary), and changing the labels accordingly. $\square $ (Claim 1)

Note that this equivalence relies on the fact that α is vstar-free, which implies that β does not occur inside a Kleene star. For xregexes that are not vstar-free, we can only conclude $\mathcal {L}(\alpha )\supseteq \mathcal {L}(\alpha _{1})\cup \mathcal {L}(\alpha _{2})$. This is easily seen considering the example of x{a}y{b}(&x ∨ &y)^∗, which would be rewritten to x{a}(&x)^∗ and y{b}(&y)^∗.

We repeat this rewriting procedure on every created vstar-free xregex that is not an xregex path. This procedure terminates, as every rewriting removes a disjunction that contains at least one variable (binding or reference). Hence if α contains $k\in \mathbb {N}_{>0}$ disjunctions, this process results in xregex paths α ₁, …, α _n for some n ≤ 2^k, and $\mathcal {L}(\alpha )=\bigcup _{i=1}^{n}\mathcal {L}(\alpha _{i})$.

Example 3.18

Let α := x{Σ ^∗}⋅& x ⋅ (x{Σ ^∗} ∨ y{Σ ^∗}) ⋅ (&x ∨ &y) ⋅ &x. Multiplying out the disjunctions, we obtain the following xregex paths:

$$\begin{array}{@{}rcl@{}} \alpha_{1} &=& x\{\Sigma^{*}\}\cdot\&x\cdot x\{\Sigma^{*}\}\cdot \&x\cdot\&x,\\ \alpha_{2} &=& x\{\Sigma^{*}\}\cdot\&x\cdot x\{\Sigma^{*}\} \cdot \&y\cdot\&x,\\ \alpha_{3} &=& x\{\Sigma^{*}\}\cdot\&x\cdot y\{\Sigma^{*}\}\cdot \&x\cdot\&x,\\ \alpha_{4} &=& x\{\Sigma^{*}\}\cdot\&x\cdot y\{\Sigma^{*}\}\cdot \&y\cdot\&x. \end{array} $$

Then $\mathcal {L}(\alpha )=\bigcup _{i=1}^{4}\mathcal {L}(\alpha _{i})$.

This transformation process might result in an exponential number of xregex paths; but as efficiency is not of concern right now, this is not a problem (the followup paper Freydenberger [13] shows that this blowup can be avoided with a more involved construction). Each of these xregex paths is then transformed into a functional regex formula:

Lemma 3.19

There is an algorithm that, given α ∈ X R P, computes $\rho \in \textsf{RGX}^{\{\pi ,\zeta ^=\}}$ with $\mathcal {L}(\rho )=\mathcal {L}(\alpha )$.

Proof

Before we start with the proof, note that we can safely assume that α does not contain ∅: If ∅ occurs inside a Kleene star (or a disjunction), that Kleene star (or disjunction) cannot contain any variable bindings or references, as α is an xregex path. Hence, we can remove ∅ as in the proof of Theorem 3.12. All other occurrences of ∅ imply $\mathcal {L}(\alpha )=\emptyset $ – in this case, we are done.

Our goal is to rewrite the xregex path α into an equivalent core spanner of the form π _∅ S δ, where δ is a regex formula, and S is a sequence of string equality selections.

The main idea of the construction is quite straightforward: We basically replace each variable reference &x with a unique x _i{Σ ^∗}, and use a string equality $\zeta ^=_{x,x_{i}}$ to connect x _i with the appropriate binding. The only technical problem is that unlike regex formulas, xregexes allow variables to be bound multiple times. We solve this by using a unique variable for every occurrence of a variable binding in α.

As explained above, the xregex path α can be understood as a concatenation α = α ₁⋯α _n, where each α _i is either a proper regular expression, a variable reference, or a variable binding of the form $\alpha _{i} = x\{\hat {\alpha }\}$, where $\hat {\alpha }$ is also an xregex path.

Now, if we choose any occurrence of a variable reference &x in α, exactly one of the following two cases applies:

1.
There is no binding x{} in α that to the left of that occurrence of &x, or
2.
there is a binding x{} in α that is to the left of that occurrence of &x.

In the first case, this &x will always default to ε, which means that we can safely replace it with ε.

In the second case, we see that this &x will always refer to the variable binding x{} that is closest to it to the left in α. In other words, we can simply read α from left to right. All &x before the first binding for x default to ε; and all &x after the first binding for x refer to the most recent binding for x (recall that, according to our definition of xregexes, no variable binding for a variable x may contain another binding of x).

This allows us to rewrite α into an xregex path γ with $\mathcal {L}(\gamma )=\mathcal {L}(\alpha )$ such that no occurrence of a variable reference &x in γ refers to the default value ε, and every variable binding x{⋯} occurs at most once. This is done the following way: We read α from left to right. If we encounter a reference &x for which no binding has been seen, we replace it with ε. If we encounter a binding x{} that has already been seen before, we replace it with a binding for a new variable $\hat {x}$, and all occurrences of &x are renamed to $\&\hat {x}$. (Of course, further occurrences of x{} would require further new variables.) For example, the xregex path

$$\alpha_{2} = x\{\Sigma^{*}\}\cdot\&x\cdot x\{\Sigma^{*}\} \cdot \&y\cdot\&x $$

from Example 3.18 would be rewritten to

$$\gamma_{2} = x\{\Sigma^{*}\}\cdot\&x\cdot \hat{x}\{\Sigma^{*}\} \cdot \varepsilon\cdot\&\hat{x}. $$

After rewriting α to γ, the next step is to transform γ into a regex formula δ by replacing all variable references in a manner that is similar to the proof of Theorem 3.3. More specifically, we construct δ by replacing, for each x ∈ V a r s(γ), the i-th occurrence of &x in γ with x _i{Σ ^∗}. Note that δ is functional: Each variable in SVars(δ) appears exactly once in δ; and as δ is also an xregex path, this implies that every δ-parse tree contains every variable exactly once. (Recall that we assumed that α does not contain ∅; hence, neither do γ and δ.)

For every variable x for which there occur references &x in γ, we define a selection $\zeta ^=_{V_{x}}$, where v _x := {x} ∪ {x _i ∣ x _i occurs in δ}. We let S denote a sequence of these selections (the order is irrelevant), and define the spanner representation ρ := π _∅ S δ. As we simulate the behavior of each variable binding x{⋯} and its references &x using the selection $\zeta ^=_{V_{x}}$, it is easy to see that $\mathcal {L}(\rho )=\mathcal {L}(\gamma )$ and, hence, $\mathcal {L}(\rho )=\mathcal {L}(\alpha )$. □

Example 3.20

Consider the xregex path

$$\alpha:= \&x\cdot x\{\Sigma^{*}\cdot y\{\Sigma^{*}\}\}\cdot \&x\cdot \&y\cdot y\{\Sigma^{*}\}\cdot\&x\cdot\&y. $$

The construction from the proof of Lemma 3.19 leads to the equivalent xregex path

$$\gamma:= \varepsilon\cdot x\{\Sigma^{*}\cdot y\{\Sigma^{*}\}\}\cdot \&x\cdot \&y\cdot \hat{y}\{\Sigma^{*}\}\cdot\&x\cdot\&\hat{y}, $$

from which we derive the functional regex formula

$$\delta:= x\left\{\Sigma^{*} y\{\Sigma^{*}\}\right\} x_{1}\{\Sigma^{*}\} y_{1}\{\Sigma^{*}\} \hat{y}\{\Sigma^{*}\} x_{2}\{\Sigma^{*}\} \hat{y}_{1}\{\Sigma^{*}\}, $$

which we use in the spanner representation $\rho := \pi _{\emptyset }\zeta ^=_{x,x_{1},x_{2}}\zeta ^=_{y,y_{1}}\zeta ^=_{\hat {y},\hat {y}_{1}} \delta .$ Then $\mathcal {L}(\alpha )=\mathcal {L}(\rho )$.

As these spanner representations are Boolean, they are also union compatible. Hence, we can now combine Lemma 3.17 and Lemma 3.19 to observe the following.

Theorem 3.21

There is an algorithm that, given α ∈ v s f X R, computes $\rho \in \textsf{RGX}^{\{\pi ,\zeta ^=,\cup \}}$ with $\mathcal {L}(\rho )=\mathcal {L}(\alpha )$.

In Section 4.2, we use Theorem 3.21 together with the undecidability results from [12] to obtain multiple lower bounds for static analysis problems. Theorem 3.21 also raises the question whether every language that is recognized by a core spanner is also recognized by a vstar-free regular expression. As we have already seen in Example 3.8, it is possible to express the language

$$L_{\text{imp}}:=\{w^{n}\mid w\in\Sigma^{+}, n\geq 2\} $$

with core spanners. Hence, under certain conditions, core spanners can simulate constructions like (&x)^∗.

While L _imp might seem to be an obvious witness that separates the classes of languages that are recognized by core spanners and by vstar-free xregexes, proving this appears to be quite involved. Instead, we consider a related language, which allows us to use the following tool:

Definition 3.22

Let $k\in \mathbb {N}_{>0}$. We call a set $A \subseteq \mathbb {N}^{k}$ linear if there exist an r ≥ 0 and $m_{0},\ldots ,m_{r}\in \mathbb {N}^{k}$ with $A=\{m_{0}+ m_{1} i_{1} + m_{2} i_{2} + {\cdots } + m_{r} i_{r} \mid i_{1},i_{2},\ldots ,i_{r}\in \mathbb {N}\}$ . A set $A \subseteq \mathbb {N}^{k}$ is semi-linear if it is a finite union of linear sets. Assume Σ = {a ₁, a ₂, …, a _k} with |Σ| = k. The Parikh map $\Psi \colon \Sigma ^{*}\to \mathbb {N}^{k}$ is defined by $\Psi (w):= (|w|_{a_{1}},|w|_{a_{2}},\ldots ,|w|_{a_{k}})$, and is extended to languages by Ψ(L) := {Ψ(w)∣w ∈ L}. We call L semi-linear if Ψ(L) is semi-linear.

According to Parikh’s Theorem [32], every context-free language is semi-linear. Moreover, as shown by Ginsburg and Spanier [19], a set is semi-linear if and only if it is definable in Presburger arithmetic. Building on this, we state the following.

Theorem 3.23

For every α ∈ v s f X R, the language $\mathcal {L}(\alpha )$ is semi-linear.

Proof

In order to increase the readability, we prove the claim for the case |Σ| = 2 (the adaption to larger alphabets is obvious). We assume Σ = {a, b} and define Ψ(a) := (1, 0) and Ψ(b) := (0, 1). Assume that V a r s (α) = {x ₁, …, x _k} for some $k\in \mathbb {N}_{>0}$.

It suffices to prove the claim for α ∈ X R P, as semi-linear sets are closed under union, and (according to Lemma 3.17) every vstar-free xregex is equivalent to a finite union of xregex paths.

As explained in the proof of Lemma 3.19 (in the construction of γ), we can also assume without loss of generality that every variable binding x{⋯} occurs exactly once in α, and that no variable reference &x _i uses the default binding ε. In particular, this means that in every α-parse tree, each variable x _i stores exactly one word w _i.

Let α be an xregex path that satisfies these conditions. Our goal is to construct a Presburger formula φ such that $\varphi(n^{\mathtt {a}},n^{\mathtt {b}})$ is true if and only if $(n^{\mathtt {a}},n^{\mathtt {b}})\in \Psi (\mathcal {L}(\alpha ))$. This formula will use variables $x^{\mathtt {a}}_{i}$ and $x^{\mathtt {b}}_{i}$ to represent |w _i $|_{\mathtt{a}}$ and |w _i $|_{\mathtt{b}}$, respectively. Recall that, due to our initial assumptions, each reference &x _i refers to the same word w _i; hence, we can safely define the corresponding variables $x^{\mathtt {a}}_{i}$ and $x^{\mathtt {b}}_{i}$ “globally” in φ.

Let I⊆{1, …, k}. We use x and x _I as abbreviations for the sequences $x^{\mathtt {a}}_{1},x^{\mathtt {b}}_{1}, \ldots $, $x^{\mathtt {a}}_{k},x^{\mathtt {b}}_{k}$ and $\left (x^{\mathtt {a}}_{i},x^{\mathtt {b}}_{i} : i \in I\right )$, and define

$$\varphi(n^{\mathtt{a}},n^{\mathtt{b}}):= \exists \mathbf{x}\colon \varphi_{\alpha}(n^{\mathtt{a}},n^{\mathtt{b}},\mathbf{x}), $$

where φ _α with V a r s(α) = {x ₁, …, x _k} is constructed according to the following general procedure.

Given an xregex path γ, we define a Presburger formula φ _γ as follows: First, as γ is an xregex path, there is a decomposition γ = γ ₁ ⋅ γ ₂⋯γ _l($l\in \mathbb {N}_{>0}$), where each γ _i is either a proper regular expression, a variable reference, or a variable binding of the form $x\{\hat {\gamma _{i}}\}$ such that $\hat {\gamma _{i}}$ is also an xregex path. For each γ _i, we use variables $n^{\mathtt {a}}_{i}$ and $n^{\mathtt {b}}_{i}$ to denote the number of a or b that occur in the subword that is generated by γ _i.

We denote the set of all variables that are bound or referenced in γ _i by

$${\textsf{VarsBR}\left( \gamma_{i}\right)} := {\textsf{Vars}\left( \gamma_{i}\right)} \cup \{x \mid \&x \text{ occurs in }\gamma_{i}\}. $$

In a slight abuse of notation, we identify $\mathbf {x}_{\textsf{VarsBR}(\gamma _{i})}$ with the sequence ($x^{\mathtt{a}}$, $x^{\mathtt{b}}$: x ∈ V a r s B R(γ _i)).

Keeping this in mind, we define

$$\begin{array}{@{}rcl@{}} &&\varphi_{\gamma}\left( n^{\mathtt{a}},n^{\mathtt{b}},\mathbf{x}_{{\textsf{VarsBR}\left( \gamma\right)}}\right):= \exists n^{\mathtt{a}}_{1}, n^{\mathtt{b}}_{1}, {\ldots} n^{\mathtt{a}}_{l}, n^{\mathtt{b}}_{l}\colon\\ &&\,\,\,\,\left( \left( n^{\mathtt{a}} = n^{\mathtt{a}}_{1} + {\cdots} + n^{\mathtt{a}}_{l}\right) \!\wedge\! \left( n^{\mathtt{b}} = n^{\mathtt{b}}_{1} + {\cdots} + n^{\mathtt{b}}_{l}\right) \!\wedge\! \bigwedge\limits_{i=1}^{l} \varphi_{\gamma_{i}}\left( n^{\mathtt{a}}_{i}, n^{\mathtt{b}}_{i},\mathbf{x}_{{\textsf{VarsBR}\left( \gamma_{i}\right)}}\right)\right), \end{array} $$

where the Presburger formulas are defined as follows:

If γ _i is a proper regular expression, then as $\mathcal {L}(\gamma _{i})$ is semi-linear (as a consequence of Parikh’s theorem [32], every regular language is semi-linear). Hence, due to Ginsburg and Spanier [19], there is a Presburger formula $\hat {\varphi }_{\gamma _{i}}$ such that $\hat {\varphi }_{\gamma _{i}}(n^{\mathtt {a}}, n^{\mathtt {b}})$ is true if and only if $(n^{\mathtt {a}}, n^{\mathtt {b}})\in \Psi (\mathcal {L}(\gamma _{i}))$. We define
$$\varphi_{\gamma_{i}}\left( n^{\mathtt{a}}_{i},n^{\mathtt{b}}_{i},\mathbf{x}_{\textsf{VarsBR}(\gamma_{i})}\right):= \hat{\varphi}_{\gamma_{i}}\left( n^{\mathtt{a}}_{i},n^{\mathtt{b}}_{i}\right). $$
In order to avoid potential confusion, note that in this case $\mathbf {x}_{\textsf{VarsBR}(\gamma _{i})}$ is the empty sequence. This is due to the fact that γ _i is a proper regular expression, which implies V a r s B R(γ _i) = ∅.
If γ _i = &x _j for some 1 ≤ j ≤ l, we define
$$\varphi_{\gamma_{i}}\left( n^{\mathtt{a}}_{i},n^{\mathtt{b}}_{i},\mathbf{x}_{{\textsf{VarsBR}\left( \gamma_{i}\right)}}\right):= \left( n^{\mathtt{a}}_{i}=x^{\mathtt{a}}_{j}\right) \wedge \left( n^{\mathtt{b}}_{i}=x^{\mathtt{b}}_{j}\right). $$
If γ _i = x _j{δ} for some 1 ≤ j ≤ l and some xregex path δ, we define
$$\varphi_{\gamma_{i}}\left( n^{\mathtt{a}}_{i},n^{\mathtt{b}}_{i},\mathbf{x}_{{\textsf{VarsBR}\left( \gamma_{i}\right)}}\right):= \left( n^{\mathtt{a}}_{i}=x^{\mathtt{a}}_{j}\right) \wedge \left( n^{\mathtt{b}}_{i}=x^{\mathtt{b}}_{j}\right) \wedge \varphi_{\delta}\left( n^{\mathtt{a}}_{i},n^{\mathtt{b}}_{i},\mathbf{x}_{\textsf{VarsBR}(\delta)}\right). $$

While the definition recurses in the case of xregex paths that contain variable bindings (the third case in the definition of $\varphi _{\gamma _{i}}$ above), the formula φ is still ensured to be finite and well-defined (as δ is always a subexpression of γ and, hence, shorter).

Recall that by our initial assumption, for every variable x _i, each variable reference &x _i refers to the same word w _i. Taking this into account, we can prove that

$$\Psi(\mathcal{L}(\alpha))=\{(n^{\mathtt{a}},n^{\mathtt{b}})\mid \varphi(n^{\mathtt{a}},n^{\mathtt{b}}) \text{ is true}\} $$

via a straightforward structural induction. □

We use Theorem 3.23 to separate the classes of languages that are recognized by core spanners and by vstar-free xregexes:

Lemma 3.24

Let L _nsl := {(ab ^m)ⁿ ∣ m, n ≥ 2} and $\rho := \zeta ^{R_{\text {com}}}_{x,y} (x\{\mathtt {a}\mathtt {b}\mathtt {b}^{+}\}y\{\Sigma ^{+}\})$ for Σ := {a, b}. Then $L_{\text {nsl}}=\mathcal {L}(\rho )$, but there is no α ∈ v s f X R with $\mathcal {L}(\alpha )=L_{\text {nsl}}$.

Proof

Assume that there is an α ∈ v s f X R with $\mathcal {L}(\alpha )=L_{\text {nsl}}$. By Theorem 3.23, L _nsl must be semi-linear. Note that Ψ(L _nsl) = {(n, m n)∣m, n ≥ 2}. As semi-linear sets are closed under projection (cf. Ginsburg and Spanier [19]), this implies that the set C := {m n ∣ m, n ≥ 2} is semi-linear, and due to closure under complementation (also cf. [19]), the set P = {p ∣ p is prime, p = 0, or p = 1} is semi-linear as well. However, semi-linear sets are finite unions of linear sets, and so P contains a subset $P_{c,a} := \{ c+an \mid n \in \mathbb {N}_{>0} \}$ of prime numbers for c ≥ 2 and a ≥ 2. Obviously, c + a c = c(1 + a) ∈ P _{c, a}, but c(1 + a) is a composite number. Hence, there is no α ∈ v s f X R with $\mathcal {L}(\alpha ) = L_{\text {nsl}}$. □

We do not need the join operator to define non-semi-linear languages: Consider the core spanner representation ρ from Example 3.14 with $\mathcal {L}(\rho )=L_{\text {nsl}}$. If we construct $\hat {\rho }$ as explained below that example, we obtain $\mathcal {L}(\hat {\rho })=\{ww\mid w\in L_{\text {nsl}}\}$, which is also not semi-linear.

It is worth pointing out Lemma 3.24 does not resolve the open question from [7] whether there is a language that is recognized by a core spanner, but not by an xregex, as Theorem 3.23 only applies to vstar-free xregexes. We have already seen languages that are not semi-linear, but are recognized by xregexes: The language L _nsl is recognized by α _nsl := x{abb ⁺}&x ⁺; and a similar approach is used for the following language (which we already met in Example 2.4):

Example 3.25

Let Σ := {a}, and define the language L _npr := {a ^mn ∣ m, n ≥ 2}. In other words, L _npr is the language of all words a ⁱ with i ≥ 4 such that i is not a prime number. Let α _npr := x{aa ⁺} ⋅ (&x)⁺. Then $\mathcal {L}(\alpha _{\text {npr}})=L_{\text {npr}}$.

While L _nsl and L _npr are defined by very similar xregexes, the latter cannot be recognized by core spanners. In order to show this with a semi-linearity argument, we observe:

Theorem 3.26

Let |Σ| = 1 and let P be a core spanner over Σ. Then $\mathcal {L}(P)$ is semi-linear.

Proof

We prove this by showing that on unary terminal alphabets, every EC ^reg-language is semi-linear. Due to Theorem 3.12, this proves the claim.

Let Σ = {a}, and consider any EC ^reg-formula φ(w) over Σ. We show that $\mathcal {L}(\varphi )$ is semi-linear by converting φ into a Presburger formula $\hat {\varphi }$ for the set $\Psi (\mathcal {L}(\varphi ))=\{|w|\mid w\in \mathcal {L}(\varphi )\}$. We obtain $\hat {\varphi }$ by rewriting φ in the following way:

1.
Each quantifier ∃x is replaced with $\exists \hat {x}$.
2.
Each regular constraint L _A(x) is replaced with a formula $\hat {\varphi }_{A}(\hat {x})$ for the set $\{|x|\mid x \in \mathcal {L}(A)\}$. As each $\mathcal {L}(A)$ is a regular language, this is possible according to Ginsburg and Spanier [19].
3.
Each word equation η _L = η _R is replaced with the equation sum(η _L) = sum(η _R), where the function sum is defined by sum(a) := 1, $\text {sum}(x):= \hat {x}$ for x ∈ X, and sum(α ⋅ β) := sum(α) + sum(β).

For example, the word equation x a x y x = a y z z y a is converted into the Presburger equation $\hat {x} + 1 + \hat {x} + \hat {y}+ \hat {x} = 1 + \hat {y}+ \hat {z}+ \hat {z}+ \hat {y} + 1$ (for Σ = {a}). Intuitively, each variable $\hat {x}$ in $\hat {\varphi }$ contains the length of x in φ (which, as |Σ| = 1, corresponds to the Parikh image of that word). Hence, the Presburger formula $\hat {\varphi }$ defines the set $\Psi (\mathcal {L}(\varphi ))$. According to [19], this implies that $\Psi (\mathcal {L}(\varphi ))$ is semi-linear, which means that $\mathcal {L}(\varphi )$ is semi-linear. This concludes the proof. □

Note that this construction only applies to unary alphabets, as this is the only case where there is a one-to-one correspondence between words and their Parikh images.

Apart from the observation that L _npr from Example 3.25 is not recognized by core spanners, Theorem 3.26 also allows us to conclude the following.

Corollary 3.27

If |Σ| = 1, then $\mathcal {L}(P)$ is regular for every core spanner P.

In other words, for unary terminal alphabets, core spanners recognize exactly the same class as regular spanners, namely the class of regular languages (which, in the unary case, is identical to the class of context-free languages). Furthermore, Lemma 3.24 and Theorem 3.26 together show the following.

Corollary 3.28

The class of languages that is recognized by core spanners is not closed under homomorphisms.

We conclude this section with a summary of our insights into the relative expressive power of the various models. To increase readability, we use the following definitions: Let R E G, X R, and P A T denote the class of regular expressions, xregex, or patterns, respectively. For a class of language recognizing mechanisms $\mathcal {D}$, let $\mathcal {L}(\mathcal {D})$ denote the class of languages that are recognized by elements of $\mathcal {D}$. For example, $\mathcal {L}(\textsf{PAT})$ is the class of pattern languages, and $\mathcal {L}(\textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}})$ is the class of languages that are recognized by core spanners. The hierarchy in Fig. 2 is obtained by combining the results in the present section with the fact that every pattern language contains either exactly one or infinitely many words (first observed by Angluin [1]), and that there are regular languages that are not E C-recognizable (see Karhumäki et al. [26]). Two sets of question remain open: Firstly, although Theorem 3.26 together with Example 3.25 shows that there is a language that is recognized by xregex, but not by EC ^reg (and, hence, also not by E C or $\textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}$), it remains open whether the reverse direction holds as well. Secondly, although we know that $\mathcal {L}(\textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}})\subseteq \mathcal {L}({\textsf{EC}^{\textsf{reg}}})$, we do not know whether this inclusion is strict. In fact, it even remains open whether there is a language that is recognized by E C, but not by $\textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}$. This second set of question is discussed in more detail in Freydenberger [13].

4 Decision Problems

4.1 Spanner Evaluation

We first examine the combined complexity of the evaluation problem for core spanners. To this end, we define the problem CSp−Eval: Given a core spanner representation $\rho \in \textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}$, a word w ∈ Σ ^∗, and a (SVars(ρ), w)-tuple μ, is μ ∈ [[ρ]](w)? In order to prove lower bounds for this problem, we consider the membership problem for pattern languages: Given a pattern α and a word w, decide whether $w\in \mathcal {L}(\alpha )$. As shown by Jiang et al. [24], this problem is NP-complete (for pattern languages that do not allow replacing variables with ε, this was already shown by Angluin [1]). Due to Theorem 3.3, we observe the following (the proof of NP-membership is straightforward).

Theorem 4.1

CSp−Eval is NP -complete, even if restricted to $\textsf{RGX}^{\{\pi ,\zeta ^=\}}$.

Proof

In order to prove NP-hardness, it suffices to give a polynomial time reduction from the membership problem for pattern languages to CSp−Eval. Given a pattern α and a word w, we use Theorem 3.3 to construct a spanner representation $\rho _{\alpha }\in \textsf{RGX}^{\{\zeta ^=\}}$ in polynomial time such that $\mathcal {L}(\alpha )=\mathcal {L}(\rho _{\alpha })$. Next, we define ρ := π _∅ ρ _α. As ρ represents a Boolean spanner, we define μ to be the empty tuple (). Now, μ ∈ [[ρ]](w) holds if and only if $w\in \mathcal {L}(\alpha )$.

We prove membership in NP using the following NP-algorithm: Assume that we are given a core spanner representation ρ, a word w ∈ Σ ^∗, and a w-tuple μ. For every regex formula γ in ρ, we nondeterministically guess a w-tuple μ _γ. By definition, each of these tuples has a size that is polynomial in |w|. In addition to this, for every union (ρ ₁ ∪ ρ ₂), we guess a representation ρ _i that is ignored. We then verify these guesses deterministically: First, we discard all parts of ρ that are ignored, and obtain a spanner representation $\hat {\rho }\in \textsf{RGX}psj$. For all remaining regex formulas γ in $\hat {\rho }$, we check whether μ _γ is consistent with γ and w. Obviously, this can be done in polynomial time. If all of these checks pass, we evaluate all operators in $\hat {\rho }$. As $\hat {\rho }$ contains no unions, the result of these evaluations is always either ∅, or a set that contains exactly one w-tuple. Hence, this process only takes polynomial time. Furthermore, when it terminates, it results either in ∅, or in a w-tuple $\hat {\mu }$. In the latter case, we return True if $\hat {\mu }=\mu $. □

The question arises whether there are natural restrictions to CSp−Eval that make this problem tractable. It appears that any subclass of the core spanners that extends regular spanners in a meaningful way while having a tractable evaluation problem cannot be allowed to recognize the full class of pattern languages.

For pattern languages, it was shown by Ibarra et al. [23] that bounding the number of variables in the pattern leads to an algorithm for the membership problem with a running time that is polynomial, although in $\mathcal {O}(n^{k})$ (where n is the length of the word w, and k the number of variables). From a parameterized complexity point of view (see e. g. Grohe and Flum [20]), this is usually not considered satisfactory. Without going too much into details, in parameterized complexity, one generally considers parameterized problems tractable that belong to the class FPT (from fixed-parameter tractable). This class is defined as follows: The input of a parameterized problem is a pair (x, k), where x is the input of the non-parameterized problem (e. g., a pattern α and a word w), and k is a parameter of the input (e. g., the number of variables in α). The parameterized problem is in FPT if there exist a computable function f, a constant c ≥ 0, and an algorithm that decides the problem in time O(f(k)n ^c). We do not define the class W[1], but we note that the standard complexity theoretic assumption is that if a problem is W[1]-hard, it is not in FPT.

It was first observed by Stephan et al. [34] that the membership problem for pattern languages is W[1]-complete if the number of variable occurrences (not of variables) is used as a parameter (see Fernau et al. [11] for the full proof). As the number of variable occurrences in a pattern corresponds to the number of variables in an equivalent spanner, this implies that using the number of variables in a spanner as parameter leads to W[1]-hardness for this parameter of CSp−Eval.

Fernau and Schmid [10] and Fernau et al. [11] discuss these and various other potential restrictions to pattern languages that still do not lead to tractability (among these a bound on the length of the replacement of each variable, which corresponds to a bound on the length of spans). On the other hand, Reidenbach and Schmid [33] and Fernau et al. [9] examine parameters for patterns that make the membership problem tractable. While this does not directly translate to spanners, the authors consider these directions promising for further research.

But apart from these potential restrictions on the use of string equality, other restrictions are needed, as the use of join also makes evaluation intractable:

Proposition 4.2

CSp−Eval is NP -complete, even if restricted to RGX ^{π,⋈}.

Proof

We prove this with a reduction from the Clique problem: Given an undirected graph G = (V, E) and a number k ≤ |v|, decide whether G contains a clique of size k. This problem is NP-complete (cf. Garey and Johnson [18]). Consider an undirected graph G = (V, E) with V = {1, …, n} for some n ≥ 1, and a number k ≤ n. Let a ∈ Σ and define w := a ⁿ and ρ := ⋈_1≤i<j≤k α _i,j, where each α _i,j is defined by

$$\alpha_{i,j}:= \underset{\underset{u<v}{\{u,v\}\in E,}}{\bigvee}\:\mathtt{a}^{u-1}\: x_{i}\{\mathtt{a}\}\: \mathtt{a}^{v-u-1}\: x_{j}\{\mathtt{a}\}\: \mathtt{a}^{n-v}. $$

In other words, each part of the disjunction corresponds to a choice of u and v , which allows [[α _i,j]](w) to map x _i to the u-th and x _j to the v-th letter of w. Then μ ∈ [[ρ]](w) holds if and only if there exist distinct nodes v ₁, …, v _k ∈ V such that {v _i, v _j} ∈ E for all 1 ≤ i < j ≤ k; and μ(x _i) = [v _i, v _i + 1〉 for 1 ≤ i ≤ k. Thus, the empty tuple is an element of [[π _∅ ρ]](w) if and only if G contains a clique of size k. □

We also consider the data complexity of the evaluation problem for core spanners. For every core spanner representation ρ over Σ, we define the decision problem CSp−Eval(ρ): Given a word w ∈ Σ ^∗ and a w-tuple μ, is μ ∈ [[ρ]](w)? Using a slight variation of the proof of Theorem 4.1, we obtain the following.

Theorem 4.3

CSp−Eval (ρ) is in NLOGSPACE for every $\rho \in \textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}$.

Proof

This result follows from a slight change to the NP-decision procedure from the proof of Theorem 4.1: We can represent the guessed w-tuples μ _γ for each regex formula γ by using two pointers for each μ _γ(x) = [i, j〉 (one pointer for i, one for j). As ρ is fixed, a finite number of such pointers suffices to represent all w-tuples. Furthermore, the verification of these guesses can also be realized nondeterministically with only a constant amount of additional pointers. □

4.2 Static Analysis

We consider the following common decision problems for core spanner representations, where the input is $\rho \in \textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}$ or $\rho _{1},\rho _{2}\in \textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}$:

1.
CSp−Sat: Is [[ρ]](w) ≠ ∅ for some w ∈ Σ ^∗?
2.
CSp−Hierarchicality: Is [[ρ]] hierarchical?
3.
CSp−Universality: Is [[ρ]] = Υ_SVars(ρ)?
4.
CSp−Equivalence: Is [[ρ ₁]] = [[ρ ₂]]?
5.
CSp−Containment: Is [[ρ ₁]] ⊆ [[ρ ₂]]?
6.
CSp−Regularity: Is [[ρ]] ∈ [[RGX ^{{π, ∪, ⋈}}]]?

We approach the first two of these problems by using Theorem 3.12 to convert core spanner representations to EC ^reg-formulas, for which satisfiability is in PSPACE (cf. Diekert [6]). Hence, we observe:

Theorem 4.4

The problem CSp−Sat is PSPACE -complete, even if it is restricted to spanner representations from $\textsf{RGX}^{\{\zeta ^=\}}$.

Proof

We begin with the upper bound. According to Theorem 3.12, for every core spanner representation ρ, there exists an EC ^reg-formula φ that realizes [[ρ]]. Furthermore, φ can be computed in polynomial time. In particular, φ is satisfiable if and only if ρ is satisfiable. As satisfiability for EC ^reg-formulas is in PSPACE (cf. Diekert [6]), this question can be answered in PSPACE.

For the lower bound, we construct a reduction to CSp−Sat from the intersection emptiness problem for regular expressions, which is defined as follows: Given (proper) regular expressions α ₁, …, α _n, decide whether $\bigcap _{i=1}^{n}\mathcal {L}(\alpha _{i})=\emptyset $. As a direct consequence of the proof of Lemma 3.2.3 in Kozen [27], this problem is PSPACE-complete (although Kozen’s proof uses automata, these are defined via regular expressions). Recall that every proper regular expression is also a functional regex formula. Hence, we can construct a Boolean spanner representation

$$\rho:= \zeta^=_{x_{1},\ldots,x_{n}}x_{1}\{\alpha_{1}\}{\cdots} x_{n}\{\alpha_{n}\}. $$

Obviously, for every w ∈ Σ ^∗, we have P(w) ≠ ∅ if and only if there exists a word v ∈ Σ ^∗ with w = v ⁿ and $v\in \mathcal {L}(\alpha _{i})$ for 1 ≤ i ≤ n. Hence, P is satisfiable if and only if $\bigcap _{i=1}^{n}\mathcal {L}(\alpha _{i})\neq \emptyset $. As PSPACE is closed under complementation, this proves PSPACE-hardness of CSp−Sat, even when restricted to representations from the class $\textsf{RGX}^{\{\zeta ^=\}}$. □

The proof of the lower bound in Theorem 4.4 uses the PSPACE-hardness of the intersection emptiness problem for regular expressions. But even if the variables in the regex formulas were only bound to Σ ^∗, it follows from Theorem 3.13 that this problem would still be at least as hard as the satisfiability problem for word equations without constraints. Considering that even proving the decidability was hard (see Diekert [6] for an overview), approaching CSp−Sat without knowledge on word equations would have required enormous additional effort.

It is also possible to use EC ^reg-formulas to express a violation of the criteria for hierarchicality. This allows us to state the following result:

Theorem 4.5

The problem CSp−Hierarchicality is PSPACE -complete, even if it is restricted to $\textsf{RGX}^{\{\zeta ^=,\times \}}$.

Proof

We begin with of the upper bound. The main idea is that non-hierarchicality can be expressed in EC ^reg-formulas. Hence, our goal is to construct a polynomial time procedure that, given a core spanner representation $\rho \in \textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}$, constructs an EC ^reg-formula φ _NH that is satisfiable if and only if [[ρ]] is not hierarchical.

Recall that, by definition, for every spanner P and every word w ∈ Σ ^∗, a w-tuple μ ∈ P(w) is not hierarchical if there exist variables x, y ∈ SVars(P) such that all of the following hold:

1.
The span μ(x) does not contain μ(y),
2.
the span μ(y) does not contain μ(x), and
3.
the spans μ(x) and μ(y) overlap (i. e., they are not disjoint).

If this is the case, we say that μ(x) and μ(y) strictly overlap. It is easy to see that two spans [i ₁, j ₁〉 and [i ₂, j ₂〉 strictly overlap if one of the following strict overlap conditions is met:

1.
i ₁ < i ₂ < j ₁ < j ₂,
2.
i ₂ < i ₁ < j ₂ < j ₁.

For an illustration of these two conditions, see Fig. 3. Our next goal is to define an EC ^reg-formula φ _ovl(x ^P, x ^C, y ^P, y ^C) that expresses the first condition when combined with an EC ^reg-formula that realizes a spanner (we do not need to define a formula for the second condition, as both conditions are symmetrical). To this purpose, we first define the EC ^reg-formula

$$\varphi_{\text{ppref}}(x,y):= \exists z: (L_{A}(z)\wedge (xz = y)),$$

where A is an NFA with $\mathcal {L}(A)=\Sigma ^{+}$. Clearly, (x, y) ∈ Σ ^∗ × Σ ^∗ satisfies φ _ppref if and only if x is a proper prefix of y. Next, we define

$$\begin{array}{@{}rcl@{}} &&\varphi_{\text{ovl}}(x^{P},x^{C},y^{P},y^{C}):=\\ &&\qquad\qquad\qquad\qquad\exists z_{1}, z_{2}:\ ((z_{1} = x^{P}x^{C})\wedge (z_{2} = y^{P} y^{C})\\ &&\qquad\qquad\qquad\qquad\qquad\qquad\wedge \varphi_{\text{ppref}}(x^{P},y^{P}) \wedge \varphi_{\text{ppref}}(y^{P},z_{1}) \wedge \varphi_{\text{ppref}}(z_{1},z_{2})). \end{array} $$

The idea behind the construction is as follows: Recall that this formula is going to be used together with an EC ^reg-formula that realizes a spanner. Hence, x ^P and x ^C represent a span [1 + |x ^P|, 1 + |x ^P x ^C|〉 = [i ₁, j ₁〉, while y ^P and y ^C represent a span [1 + |y ^P|, 1 + |y ^P y ^C|〉 = [i ₂, j ₂〉. In particular, x ^P x ^C and y ^P y ^C are both prefix of some common word w. Hence, i ₁ < i ₂ holds if and only if x ^P is a proper prefix of y ^P. Likewise, i ₂ < j ₁ and j ₁ < j ₂ hold if and only if y ^P is a proper prefix of x ^P x ^C, or x ^P x ^C is a proper prefix of y ^P y ^C, respectively.

In other words, φ _ovl checks whether the first of the two strict overlap conditions is satisfied.

We are now ready to construct φ _NH. Let $\rho \in \textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}$, and assume that SVars(ρ) = {x ₁, …, x _n} for some n ≥ 2 (spanners with less than two variables are trivially hierarchical). Using Theorem 3.12), we then construct an EC ^reg-formula $\varphi _{\rho }(x_{w}, {x^{P}_{1}}, {x^{C}_{1}}, \ldots , {x^{P}_{n}}, {x^{C}_{n}})$ that realizes [[ρ]]. We now define

$$\begin{array}{@{}rcl@{}} &&\varphi_{\text{NH}} := \exists x_{w}, {x^{P}_{1}}, {x^{C}_{1}}, \ldots, {x^{P}_{n}}, {x^{C}_{n}}: \\ &&\qquad\qquad\qquad\quad\left( \!\varphi_{\rho}\!\left( \!x_{w}, {x^{P}_{1}}, {x^{C}_{1}}, \ldots, {x^{P}_{n}}, {x^{C}_{n}}\right) \!\wedge\! \underset{\underset{i\neq j}{1\leq i,j\leq n;}}{\bigvee}\varphi_{\text{ovl}}\!\left( \!{x^{P}_{i}},{x^{C}_{i}},{x^{P}_{j}},{x^{C}_{j}}\!\right)\!\right)\!. \end{array} $$

Assume that [[ρ]] is not hierarchical. Then there exist a word w ∈ Σ ^∗, a w-tuple μ ∈ [[ρ]], and x _l, x _m ∈ SVars(ρ) such that μ(x _l) and μ(x _m) strictly overlap. As φ _ρ realizes [[ρ]], we have that μ defines an assignment $(w, w_{[1,i_{1}\rangle }, w_{[i_{1},j_{1}\rangle }, \ldots , w_{[1,i_{n}\rangle }, w_{[i_{n},j_{n}\rangle })$ that satisfies this subformula (where [i _k, j _k〉 = μ(x _k)). Furthermore, as μ(x _m) and μ(x _l) strictly overlap, either $\varphi _{\text {ovl}}({x^{P}_{l}},{x^{C}_{l}},{x^{P}_{m}},{x^{C}_{m}})$ or $\varphi _{\text {ovl}}({x^{P}_{m}},{x^{C}_{m}},{x^{P}_{l}},{x^{C}_{l}})$ is satisfied (if i _l < i _m or i _m < i _l, respectively). Hence, φ _NH is satisfiable.

Likewise, φ _NH is only satisfied if φ _ρ and (at least) one $\varphi _{\text {ovl}}({x^{P}_{l}},{x^{C}_{l}},{x^{P}_{m}},{x^{C}_{m}})$ are satisfied. This corresponds to a w-tuple μ where μ(x _l) and μ(x _m) strictly overlap. Hence, μ is not hierarchical, which means that [[ρ]] is not hierarchical.

Therefore, φ _NH is satisfiable if and only if [[ρ]] is not hierarchical. Furthermore, φ _NH can be constructed in polynomial time, as we only need to construct φ _ρ (which is possible in polynomial time, according to the proof of Theorem 4.4), and an amount of φ _ovl-formulas that is quadratic in |SVars(ρ)|, each of which has a constant length. Both constructions rely solely on the syntax of ρ, and require no further computation.

As satisfiability of EC ^reg-formulas can be decided in PSPACE, the complement of CSp−Hierarchicality is in PSPACE; and as PSPACE is closed under complementation, this means that CSp−Hierarchicality is in PSPACE.

For the lower bound, we slightly modify the proof of the lower bound for CSp−Sat. Again, we use the intersection emptiness problem for regular expressions. Given proper regular expressions α ₁, …, α _n, we define

$$\rho:= \zeta^=_{x_{1},\ldots,x_{n}}(x_{1}\{\mathtt{a}\mathtt{a}\mathtt{a}\cdot\alpha_{1}\}{\cdots} x_{n}\{\mathtt{a}\mathtt{a}\mathtt{a}\cdot\alpha_{n}\}) \times (y\{\Sigma\cdot\Sigma^{+}\}\cdot\Sigma)\times (\Sigma\cdot z\{\Sigma^{+}\cdot\Sigma\}), $$

for some a ∈ Σ. By replacing each α _i in that proof with aaa ⋅ α _i, we ensure that every word w ∈ Σ ^∗ with $[{\kern -2.3pt}[{\zeta ^=_{x_{1},\ldots ,x_{n}}(x_{1}\{\mathtt {a}\mathtt {a}\mathtt {a}\cdot \alpha _{1}\}{\cdots } x_{n}\{\mathtt {a}\mathtt {a}\mathtt {a}\cdot \alpha _{n}\}}]{\kern -2.3pt}](w)\neq \emptyset $ has at least length 3 (which is the minimal word length for which non-hierarchical spanners are possible). Furthermore, for each such w, the variable y is assigned the span that contains all positions of w except the last one, and z is assigned the span that contains all positions except the first one. Hence, these spans strictly overlap, which means that ρ is not hierarchical. On the other hand, if $[{\kern -2.3pt}[ \zeta ^=_{x_{1},\ldots ,x_{n}}(x_{1}\{\mathtt {a}\mathtt {a}\mathtt {a}\cdot \alpha _{1}\}\cdots x_{n}\{\mathtt {a}\mathtt {a}\mathtt {a}\cdot \alpha _{n}\}) ]{\kern -2.3pt}](w)=\emptyset $ , then [[ρ]] = ∅. Therefore, ρ is hierarchical if and only if there is no $w\in \bigcap _{1\leq i\leq n}\mathcal {L}(\alpha _{i})$. As this problem is PSPACE-complete, CSp−Hierarchicality is PSPACE-hard. □

For the remaining problems, we use Theorem 3.21, and the fact that the undecidability results from Freydenberger [12] also hold for vstar-free xregexes:

Theorem 4.6

The problems CSp−Universality and CSp−Equivalence are not semi-decidable, but co-semi-decidable. The problem CSp−Regularity is neither semi-decidable, nor co-semi-decidable. These results hold even if the input is restricted to $\textsf{RGX}^{\{\pi ,\zeta ^=,\cup \}}$.

Proof

The co-semi-decidability of the first two problems is obvious. We discuss this for universality: For any core spanner representation ρ, we can always decide whether [[ρ]](w) = Υ_SVars(ρ)(w) holds. Hence, we can semi-decide non-universality by enumerating all w ∈ Σ ^∗ until we find a word w with [[ρ]](w) ≠ Υ_SVars(ρ)(w). Thus, CSp−Universality is co-semi-decidable. The proof for CSp−Equivalence works analogously.

We now proceed to the proofs of the lower bounds. As shown by Freydenberger [12], if |Σ| ≥ 2, for xregexes α, the following holds:

It is not semi-decidable whether $\mathcal {L}(\alpha )=\Sigma ^{*}$,
It is neither semi-decidable, nor co-semi-decidable whether $\mathcal {L}(\alpha )$ is a regular language.

The proof in [12] takes a Turing machine $\mathcal {X}$ (with some additional technical restrictions) and computes an xregex $\alpha _{\mathcal {X}}$ with a single variable x such that $\mathcal {L}(\alpha )=\Sigma ^{*}$ if and only if $\mathcal {X}$ accepts no input, and $\mathcal {L}(\alpha _{\mathcal {X}})$ is regular if and only if $\mathcal {X}$ accepts only finitely many inputs.

These xregexes $\alpha _{\mathcal {X}}$ are defined over the alphabet Σ = {0, #} and, when adapted to the notation of this paper, are always of the following shape:

$$\alpha_{\mathcal{X}}=\alpha_{struc}\mathbin{\vee}\alpha_{state}\mathbin{\vee}\alpha_{head}\mathbin{\vee}\alpha_{mod}\mathbin{\vee}\alpha_{var}. $$

It is important to note that all subexpressions except α _{v
a
r} are proper regular expressions, while

$$\alpha_{var}=(0\mathbin{\vee} \#)^{*}\#0\cdot x\{0^{*}\} \cdot(\alpha_{1}\mathbin{\vee} \alpha_{2} \mathbin{\vee} {\cdots} \mathbin{\vee} \alpha_{n}) $$

for some $n\in \mathbb {N}p$ that depends on $\mathcal {X}$, where all α _i are xregex paths that do not contain variable bindings, and no other variable references than &x.

We note that the single variable biding x{0^∗} and all variable references &x do not occur under a Kleene star, and conclude that $\alpha _{\mathcal {X}}$ is a vstar-free xregex.

By Theorem 3.21, we can effectively convert every $\alpha _{\mathcal {X}}$ into a Boolean spanner representation $\rho _{\mathcal {X}}\in \textsf{RGX}^{\{\pi ,\zeta ^=,\cup \}}$ with $\mathcal {L}(\rho _{\mathcal {X}})=\mathcal {L}(\alpha _{\mathcal {X}})$.

Then $[{\kern -2.3pt}[ \rho _{\mathcal {X}} ]{\kern -2.3pt}]={\Upsilon }_{\emptyset }$ holds if and only if $\mathcal {L}(\alpha _{\mathcal {X}})=\Sigma ^{*}$. As this question is not semi-decidable, CSp−Universality is also not semi-decidable. As CSp−Universality is a special case of CSp−Equivalence, the latter problem is also not semi-decidable.

Furthermore, $[{\kern -2.3pt}[ \rho _{\mathcal {X}} ]{\kern -2.3pt}]$ is a regular spanner if and only if $\mathcal {L}(\alpha _{\mathcal {X}})$ is a regular language (as shown by Fagin et al. [7], when viewed as language definition mechanisms, regular spanners define exactly the class of regular languages). This question is neither semi-decidable, nor co-semi-decidable; hence, this applies to CSp−Regularity as well. □

As the proof of Theorem 4.6 relies only on Boolean spanners, the decidability status of CSp−Regularity does not change if the problem asks for hierarchical regularity (i. e., membership in [[RGX]]) instead of regularity, as the two classes coincide for Boolean spanners. Likewise, CSp−Universality remains not semi-decidable if one replaces Υ_SVars(ρ) with ${\Upsilon }^H_{\textsf{SVars}(\rho )}$.

In the construction from this proof, variables are only bound to a language a ⁺. Hence, the same undecidability results hold for spanners that use selections by equal length relation, instead of the string equality relation. While the proof builds on xregexes $\alpha _{\mathcal {X}}$ that use only a single variable x, the resulting core spanners use an unbounded amount of variables, as every occurrence of a variable reference &x in an xregex path is converted to a spanner variable x _i. But undecidability remains even if we bound the number of variables in the spanners, as the $\alpha _{\mathcal {X}}$ can be modified to use only a bounded number of variable references (see Section 4.1 in [12]). Theorem 4.6 also implies that CSp−Containment is not semi-decidable. This holds even for a more restricted class of spanners:

Theorem 4.7

The problem CSp−Containment is not semi-decidable, even if it is restricted to $\textsf{RGX}^{\{\pi ,\zeta ^=\}}$.

Proof

This proof uses the undecidability of the inclusion problem for pattern languages, which is defined as follows: Given two patterns α and β, decide whether $\mathcal {L}(\alpha )\subseteq \mathcal {L}(\beta )$.

For unbounded sizes of Σ, this undecidability was proven by Jiang et al. [25], and Freydenberger and Reidenbach [15] adapted this proof to all (non-unary) finite terminal alphabets.

Given two patterns α, β, we can use Theorem 3.3 to construct spanner representations $\rho _{\alpha },\rho _{\beta }\in \textsf{RGX}^{\{\zeta ^=\}}$ with $\mathcal {L}(\rho _{X})=\mathcal {L}(X)$ for X ∈ {α, β}, and turn these into representations of Boolean spanners $\hat {\rho }_{X}:=\pi _{\emptyset }\rho _{X}$. Then $[{\kern -2.3pt}[ \hat {\rho }_{\alpha } ]{\kern -2.3pt}](w)\subseteq [{\kern -2.3pt}[ \hat {\rho }_{\beta } ]{\kern -2.3pt}](w)$ holds for all w ∈ Σ ^∗ if and only if $\mathcal {L}(\alpha )\subseteq \mathcal {L}(\beta )$.

This shows that CSp−Containment is not decidable. As it is obviously co-semi-decidable, this also shows that CSp−Containment is not semi-decidable. □

As shown by Bremer and Freydenberger [4], the inclusion problem for pattern languages remains undecidable if the number of variables in the patterns is bounded. In fact, that proof constructs patterns where even the number of variable occurrences is bounded. Therefore, CSp−Containment is not semi-decidable even if restricted to representations from $\textsf{RGX}^{\{\pi ,\zeta ^=\}}$ with a bounded number of variables. It is a hard open question whether the equivalence problem for pattern languages is decidable (cf. Ohlebusch and Ukkonen [31], Freydenberger and Reidenbach [15]). Undecidability of this problem would imply undecidability of CSp−Equivalence, even if restricted to representations from $\textsf{RGX}^{\{\pi ,\zeta ^=\}}$.

We conclude this part of the section with a table that summarizes our results on decision problems:

Problem	Status	Reference
CSp−Eval	NP-complete	Theorem 4.1, Proposition 4.2
CSp−Eval(ρ)	in NLOGSPACE	Theorem 4.3
CSp−Sat	PSPACE-complete	Theorem 4.4
CSp−Hierarchicality	PSPACE-complete	Theorem 4.5
CSp−Universality	co-semi-decidable, not semi-decidable	Theorem 4.6
CSp−Equivalence	co-semi-decidable, not semi-decidable	Theorem 4.6
CSp−Containment	co-semi-decidable, not semi-decidable	Theorem 4.7
CSp−Regularity	neither semi-, nor co-semi-decidable	Theorem 4.6

Details under which restrictions the lower bounds persist can be found in the respective results.

4.2.1 Minimization and Relative Succinctness

In order to address the minimization of spanner representations, we first formalize the notion of the size or complexity of a spanner representation. Even for proper regular expressions, there are various different definitions of size, see e. g. Holzer and Kutrib [22], and there might be convincing reasons to add additional weight to the number of variables or other parameters. As we shall see, these distinctions do not affect the negative results that we prove later. Hence, instead of defining a single fixed notion of size, we use the following general definition of complexity measures from Kutrib [29]:

Definition 4.8

Let SR be a class of spanner representations. A complexity measure for SR is a recursive function $c\colon \textsf{SR}\to \mathbb {N}$ such that for each Σ, the set of all ρ ∈ SR that represent spanners over Σ can be effectively enumerated in order of increasing c(ρ), and does not contain infinitely many ρ ∈ SR with the same value c(ρ).

By recursive, we mean a function that is total and computable. Definition 4.8 is general enough to include all notions of complexity that take into account that descriptions are commonly encoded with a finite number of distinct symbols, and that it should be decidable if a word over these symbols is a valid encoding from SR. Regardless of the chosen complexity measure, computable minimization of core spanners is impossible:

Theorem 4.9

Let c be a complexity measure for $\textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}$ . There is no algorithm that, given a $\rho \in \textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}$, computes an equivalent $\hat {\rho }\in \textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}$ that is c-minimal.

Proof

Define U _min to be the set of c-minimal core spanner representations of Υ_∅. By the definition of a complexity measure, U _min is finite. Hence, given a core spanner representation ρ, we can decide whether ρ ∈ U _min.

Now assume there is an algorithm MIN _c that minimizes core spanner representations with respect to c. Given a core spanner representation ρ, we can decide whether [[ρ]] = [[Υ_∅]], by checking whether MIN _c(ρ) ∈ U _min. But as shown in Theorem 4.6, this problem is undecidable. Hence, MIN _c cannot exist. □

In addition to regex formulas, Fagin et al. [7] also define spanner representations that are based on so-called vset- and vstk-automata (denoted by VA _set and VA _stk). They show [[VA _stk]] = [[RGX]] and [[VA _set]] = [[RGX ^{{π, ∪, ⋈}}]], and conclude that $[{\kern -2.3pt}[ \textsf{VA}_{\textsf{set}}^{{\{\pi ,\zeta ^=,\cup ,\bowtie \}}} ]{\kern -2.3pt}]=[{\kern -2.3pt}[ \textsf{VA}_{\textsf{stk}}^{{\{\pi ,\zeta ^=,\cup ,\bowtie \}}} ]{\kern -2.3pt}]=[{\kern -2.3pt}[ \textsf{RGX}^{{\{\pi ,\zeta ^=,\cup ,\bowtie \}}} ]{\kern -2.3pt}]$. Without going futher into details, we note that their equivalence proofs use computable conversions between the models. Hence, Theorem 4.9 also applies to those spanner representations from [7] that can express core spanners, like $\textsf{VA}_{\textsf{stk}}^{{\{\pi ,\zeta ^=,\cup ,\bowtie \}}}$ and $\textsf{VA}_{\textsf{set}}^{{\{\pi ,\zeta ^=,\cup ,\bowtie \}}}$, and it implies that an algorithm that converts from one of these classes of representations to another cannot guarantee that its result is minimal.

Using a technique by Hartmanis [21], we can use the fact that CSp−Regularity is not co-semi-decidable to compare the relative succinctness of regular and core spanner representations:

Theorem 4.10

Let c ₁ and c ₂ be complexity measures for the classes RGX ^{{π, ∪, ⋈}} and $\textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}$, respectively. For every recursive function $f\colon \mathbb {N}\to \mathbb {N}$, there exists a $\rho \in \textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}$ such that [[ρ]] ∈ [[RGX ^{{π, ∪, ⋈}}]], but $c_{1}(\hat {\rho })>f(c_{2}(\rho ))$ holds for every $\hat {\rho }\in \textsf{RGX}^{\{\pi ,\cup ,\bowtie \}}$ with $[{\kern -2.3pt}[{\hat {\rho }}]{\kern -2.3pt}]=[{\kern -2.3pt}[\rho ]{\kern -2.3pt}]$.

Proof

For the sake of a contradiction, assume that there exist complexity measures c ₁ for RGX ^{{π, ∪, ⋈}} and c ₂ for $\textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}$, as well as a recursive function f such that, for every core spanner representation $\rho \in \textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}$ with [[ρ]] ∈ [[RGX ^{{π, ∪, ⋈}}]], there exists a regular spanner representation $\hat {\rho }\in \textsf{RGX}^{\{\pi ,\cup ,\bowtie \}}$ with $[{\kern -2.3pt}[ \hat {\rho } ]{\kern -2.3pt}]=[{\kern -2.3pt}[ \rho ]{\kern -2.3pt}]$ and $c_{1}(\hat {\rho })\leq f(c_{2}(\rho ))$. Our goal is to show that this implies that the set

$$\textsf{NR} := \{\rho\in\textsf{RGX}^{\{\pi,\zeta^=,\cup,\bowtie\}} \mid \text{there is no $\rho_{R}\in \textsf{RGX}^{\{\pi,\cup,\bowtie\}}$ with $[{\kern-2.3pt}[ \rho ]{\kern-2.3pt}]=[{\kern-2.3pt}[ \rho_{R} ]{\kern-2.3pt}]$}\} $$

is semi-decidable. As CSp−Regularity is not co-semi-decidable (Theorem 4.6), this will yield the desired contradiction.

We define a semi-decision procedure for NR as follows: Given a core spanner $\rho \in \textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}$, compute a complexity bound n := f(c ₂(ρ)). We define

$$F_{n}:=\{\rho_{R}\in \textsf{RGX}^{\{\pi,\cup,\bowtie\}}\mid c_{1}(\rho_{R})\leq n\}. $$

By Definition 4.8, the set F _n is finite, and we can effectively enumerate its elements ρ ₁, …, ρ _k for k := |F _n|.

Also by definition, we know that if there exists a ρ _R ∈ RGX ^{{π, ∪, ⋈}} with [[ρ _R]] = [[ρ]], there exists a $\hat {\rho }_{R}\in \textsf{RGX}^{\{\pi ,\cup ,\bowtie \}}$ with $[{\kern -2.3pt}[ \hat {\rho }_{R} ]{\kern -2.3pt}]=[{\kern -2.3pt}[ \rho ]{\kern -2.3pt}]$ and $\hat {\rho }_{R}\in F_{n}$. In other words: If [[ρ]] is expressible with regular spanners, it is expressible with a regular spanner representation $\hat {\rho }$ that satisfies the complexity bound n.

For all ρ _i ∈ F _n, we now semi-decide [[ρ]] ≠ [[ρ _i]]. In order to do this, we enumerate all w ∈ Σ ^∗. In each step, if [[ρ]](w) ≠ [[ρ _i]](w) holds, we mark ρ _i as not equivalent to ρ.

If all spanners in F _n are marked, we know that no regular spanner [[ρ _R]] with [[ρ _R]] = [[ρ]] exists, and put out True. As F _n is finite, this point is reached in a finite number of steps if there is no such spanner. On the other hand, if such a spanner exists, the procedure will never terminate. Hence, we have defined a semi-decision procedure for NR, which implies that CSp−Regularity is co-semi-decidable, a contradiction to Theorem 4.6. □

Hence, the blowup from $\textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}$ to RGX ^{{π, ∪, ⋈}} is not bounded by any recursive function. As above, we can replace each of this classes with a class with the same expressive power; for example, we can replace RGX ^{{π, ∪, ⋈}} with VA _stk ^{{π, ∪, ⋈}}, VA _set, or VA _set ^{{π, ∪, ⋈}} (or, as the proof uses Boolean spanners, RGX or VA _stk, or any class between those).

We also consider the relative succinctness of representations of core spanners and representations of their complements. For every spanner P, we define its complement c o m p l(P) := Υ_Vars(P) ∖ P, and its hierarchical complement $complH(P):= {\Upsilon }^H_{\textsf{Vars}(P)}\setminus P$.

Theorem 4.11

Let c be a complexity measure for $\textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}$. For every recursive function $f\colon \mathbb {N}\to \mathbb {N}$, there exists a $\rho \in \textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}$ such that

1.
$\textsf{C}([{\kern -2.3pt}[ \rho ]{\kern -2.3pt}])\in [{\kern -2.3pt}[ \textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}} ]{\kern -2.3pt}]$, but
2.
$c(\rho )>f(c(\hat {\rho }))$ holds for every $\hat {\rho }\in \textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}$ with $[{\kern -2.3pt}[\hat {\rho }]{\kern -2.3pt}]=compl([{\kern -2.3pt}[\rho ]{\kern -2.3pt}])$.

This also holds if we consider C ^H instead of C.

Proof

It suffices to prove the claim for Boolean core spanner representations (hence, we can focus on the case of C, and do not need to consider C ^H separately). For convenience, we define the set of all Boolean core spanner representations

$$\textsf{BCSR} :=\{\rho\in\textsf{RGX}^{\{\pi,\zeta^=,\cup,\bowtie\}}\mid {\textsf{SVars}\left( \rho\right)}=\emptyset\}. $$

As preparation for the actual proof, we consider the following sets of Boolean core spanner representations:

$$\begin{array}{@{}rcl@{}} \textsf{FIN}&:=&\{\rho\in\textsf{BCSR}\mid \mathcal{L}(\rho) \text{ is finite}\},\\ \textsf{COF}&:=& \{\rho\in\textsf{BCSR}\mid \mathcal{L}(\rho) \text{ is co-finite}\}. \end{array} $$

This proof heavily relies on various sets from the first two levels of the arithmetic hierarchy (cf. Kozen [28]). Without going into further details, note that ${\Sigma ^{0}_{1}}$ is the family of all sets that are semi-decidable (recursively enumerable), ${\Pi ^{0}_{1}}$ is the family of all thats that are co-semi-decidable (co-recursively enumerable), and ${\Delta ^{0}_{1}}={\Sigma ^{0}_{1}}\cap {\Pi ^{0}_{1}}$ is the family of all sets that are decidable.

Regarding the next level, ${\Sigma ^{0}_{2}}$ is the family of all sets that are semi-decidable when using oracles for sets in ${\Sigma ^{0}_{1}}$ (or in ${\Pi ^{0}_{1}}$), ${\Pi ^{0}_{2}}$ is the family of all sets that are co-semi-decidable when using such oracles. Finally, ${\Delta ^{0}_{2}}={\Sigma ^{0}_{2}}\cap {\Pi ^{0}_{2}}$ is the family of all sets that are decidable when using oracles for sets in ${\Sigma ^{0}_{1}}$ or in ${\Pi ^{0}_{1}}$. □

A central part of our reasoning in this proof is the following observation:

Claim 1

${\textsf{COF}\not \in \Delta ^{0}_{2}}$.

Proof

As shown in Freydenberger [12], the xregexes that we used in the proof of Theorem 4.6 also prove that co-finiteness for vstar-free xregexes is ${\Sigma ^{0}_{2}}$-complete.

Hence, the proof of Theorem 4.6 also implies that C O F is ${\Sigma ^{0}_{2}}$-hard. This immediately implies ${\textsf{COF}\notin \Delta ^{0}_{2}}$; as otherwise, ${\Sigma ^{0}_{2}}={\Delta ^{0}_{2}}$ would hold, which contradicts the fact that the arithmetical hierarchy is a proper hierarchy. $\square $ (Claim 1)

Our goal is to use Claim 1 to obtain the contradiction on which this proof rests. More precisely, we shall prove that any recursive bound on the size of the core spanner for a complement can be used to prove ${\textsf{COF}\in \Delta ^{0}_{2}}$. One of the central parts of our reasoning shall be the following result.

Claim 2

${\textsf{FIN}\in \Sigma ^{0}_{1}}$.

Proof

We give the following semi-decision procedure for F I N. Let ρ ∈ B C SR. Enumerate all finite sets S ⊂ Σ ^∗. For each set, we check the following two conditions:

1.
$S\subseteq \mathcal {L}(\rho )$
2.
$\mathcal {L}(\rho )\cap (\Sigma ^{*}\setminus S)=\emptyset $

Note that both conditions are decidable: As S is finite, the first condition can be checked by deciding if $w\in \mathcal {L}(\rho )$ for each w ∈ S.

For the second condition, we first construct a regular expression α with $\mathcal {L}(\alpha )= (\Sigma ^{*}\setminus S)$. Then, we define the Boolean core spanner representation ρ _S := α ∩ ρ. As $\mathcal {L}(\rho _{S})=\mathcal {L}(\alpha )\cap \mathcal {L}(\rho )=(\Sigma ^{*}\setminus S)\cap \mathcal {L}(\rho )$, we can decide the second condition by checking if $\mathcal {L}(\rho _{S})=\emptyset $ (which is decidable, according to Theorem 4.4).

If S satisfies both conditions, $S=\mathcal {L}(\rho )$ holds. Hence, $\mathcal {L}(\rho )$ is finite, and the semi-decision procedure returns True. Furthermore, for every ρ ∈ F I N, the procedure will (after a finite number of enumerated finite sets) check the set $S=\mathcal {L}(\rho )$, and then return True. Thus, F I N is semi-decidable, which is equivalent to ${\textsf{FIN}\in \Sigma ^{0}_{1}}$. $\square $ (Claim 2)

The next observation is not very deep; but in order to streamline the flow of our later reasoning, we state it as a separate claim.

Claim 3

For every ρ ∈ B C SR, we have that ρ ∈ C O F holds if and only if there is a $\hat {\rho }\in \textsf{FIN}$ with $[{\kern -2.3pt}[ \hat {\rho } ]{\kern -2.3pt}]=\textsf{C}([{\kern -2.3pt}[ \rho ]{\kern -2.3pt}])$.

Proof

Let ρ ∈ B C SR. We begin with the if-direction. Assume there exists a $\hat {\rho }\in \textsf{FIN}$ with $[{\kern -2.3pt}[ \hat {\rho } ]{\kern -2.3pt}]=\textsf{C}([{\kern -2.3pt}[ \rho ]{\kern -2.3pt}])$. As $\hat {\rho }\in \textsf{FIN}$, the language $\mathcal {L}(\hat {\rho })$ is finite, which implies that $\mathcal {L}(\rho )=\Sigma ^{*}\setminus \mathcal {L}(\hat {\rho })$ is co-finite. Hence, ρ ∈ C O F.

For the only-if direction, let ρ ∈ C O F; i. e., $\mathcal {L}(\rho )$ is co-finite. Hence, $\Sigma ^{*}\setminus \mathcal {L}(\rho )$ is finite, and regular. Thus, there exists a proper regular expression $\hat {\rho }$ with $\mathcal {L}(\hat {\rho })=\Sigma ^{*}\setminus \mathcal {L}(\rho )$. As every proper regular expression is also a functional regex formula with no variables (and, hence, Boolean), $\hat {\rho }\in \textsf{BCSR}$ follows. This gives $\hat {\rho }\in \textsf{FIN}$, while $[{\kern -2.3pt}[ \hat {\rho } ]{\kern -2.3pt}]=\textsf{C}([{\kern -2.3pt}[ \rho ]{\kern -2.3pt}])$ holds by our choice of $\hat {\rho }$. $\square $ (Claim 3)

We now proceed to the main part of the proof, which uses these claims. Let c be a complexity measure for the class $\textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}$. Assume that there exists a recursive function $f\colon \mathbb {N}\to \mathbb {N}$ such that for all $\rho \in \textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}$ for which C([[ρ]]) is a core spanner, there exists a $\hat {\rho }\in \textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}$ with $[{\kern -2.3pt}[ \hat {\rho } ]{\kern -2.3pt}]=\textsf{C}([{\kern -2.3pt}[ \rho ]{\kern -2.3pt}])$ and $c(\rho )\leq f(c(\hat {\rho }))$.

Our goal is to show that this assumption implies that C O F is in ${\Delta ^{0}_{2}}$. We prove this by defining a decision procedure with oracles for ${\Sigma ^{0}_{1}}$ and ${\Pi ^{0}_{2}}$ on the input ρ ∈ B C SR as follows. First, compute n := f(c(ρ)), and let

$$R_{n} := \{\hat{\rho}\in\textsf{BCSR} \mid c(\hat{\rho})\leq n\}. $$

From Claim 3, we know that ρ ∈ C O F if and only if there is a $\hat {\rho }\in \textsf{FIN}$ with $[{\kern -2.3pt}[ \hat {\rho } ]{\kern -2.3pt}]=\textsf{C}([{\kern -2.3pt}[ \rho ]{\kern -2.3pt}])$. Due to our assumption on f, this holds if and only if such a $\hat {\rho }$ exists in R _n.

We now check for each $\hat {\rho }\in R_{n}$ whether it satisfies these two criteria:

1.
$\hat {\rho }\in \textsf{FIN}$
2.
$[{\kern -2.3pt}[ \hat {\rho } ]{\kern -2.3pt}]=\textsf{C}([{\kern -2.3pt}[ \rho ]{\kern -2.3pt}])$

Due to Claim 2, we know that F I N is in ${\Sigma ^{0}_{1}}$. Hence, the first criterion can be decided with a ${\Sigma ^{0}_{1}}$-oracle.

Regarding the second criterion, note that $[{\kern -2.3pt}[ \hat {\rho } ]{\kern -2.3pt}]\neq \textsf{C}([{\kern -2.3pt}[ \rho ]{\kern -2.3pt}])$ is semi-decidable (as it suffices to find a w ∈ Σ ^∗ that disproves the equality). Hence, this criterion is co-semi-decidable, which means that it can be decided with a ${\Pi ^{0}_{1}}$-oracle.

If there exists a $\hat {\rho }\in R_{n}$ that satisfies both criteria, the procedure returns True. In this case, ρ ∈ C O F holds by Claim 3; hence, this is correct.

If no such $\hat {\rho }$ can be found among the (finitely many) elements of R _n, the procedure returns False. As mentioned above, this is correct due to our assumptions on f.

As C O F can be decided by using oracles for ${\Sigma ^{0}_{1}}$ and ${\Pi ^{0}_{1}}$, we know that $\textsf{COF}\in {\Delta ^{0}_{2}}$ must hold. This contradicts Claim 1. As our only assumption was the existence of the recursive bound f, no such bound can exist.

In other words, there are core spanners where the (hierarchical) complement is also a core spanner, but the blowup between their representations is not bounded by any recursive function. Again, this holds for the other classes of representations as well.

This result has consequences to an open question of Fagin et al. One of the central tools in [7] is the core-simplification-lemma, which states that every core spanner is definable by an expression of the form π _V S A, where A is a vset-automaton, V ⊆ SVars(A), and S is a sequence of selections $\zeta ^=_{x,y}$ for x, y ∈ SVars(A).

In addition to core spanners, Fagin et al. also discuss adding a set difference operator ∖, and ask “whether we can find a simple form, in the spirit of the core-simplification lemma, when adding difference to the representation of core spanners”. It is a direct consequence of Theorem 4.11 that such a simple representation, if it exists, cannot be obtained effectively, as reducing the number of difference operators can lead to a non-recursive blowup. While this observation does not prove that such a simple form does not exist, it suggests that any proof of its existence should be expected to be non-constructive.

5 Conclusions and Further Work

In Section 3, we have seen that core spanners can express languages that are defined by patterns or by vstar-free xregexes. We used this in Section 4 to derive various lower bounds on decision problems, even for subclasses of core spanner representations. Note that in most of the cases, these lower bounds do not require the join operator, and mostly rely on the string equality selection. This can be interpreted as a sign that string equality (or repetition) is an expensive operator, in particular as similar results have been observed for related models (e. g., [2, 12, 16]). On the other hand, Proposition 4.2 demonstrates that even without string equality, join is also an expensive operator. The authors take this as a sign that the search for good restrictions on core spanners will probably have to combine restrictions on string equality and on join.

There is also reason to hope that the connections to patterns and word equations can be beneficial for spanners: There is recent work on restricted classes of pattern languages with an efficient membership problem (e. g., [10, 33]), which could lead to subclasses of spanners that can be evaluated more efficiently. Furthermore, as Theorems 3.12 and 3.13 show, core spanners and word equations with regular constraints are closely related. Recent work on word equations has also considered tasks like enumerating all solutions of an equation. The employed compression techniques (cf. [6]) might also be used to improve the evaluation of core spanners. In particular, the EC ^reg-formulas that are constructed in the proof of Theorem 3.12 have the special property that there is a variable x _w (for w), and for every solution σ and every variable x, we have that σ (x) is a subword of σ (x _w).

Freydenberger [13] builds on this observation and introduces a fragment of EC ^reg that has exactly the same expressive power as core spanners. The connection is even stronger: As shown in [13], there exist polynomial time conversions between this fragment and core spanner representations. It remains to be seen whether the connection between spanners and word equations can also be used to find interesting subclasses of core spanners that have friendlier upper bounds (in particular regarding evaluation).

Also note that conversion from vstar-free regular expressions to core spanner representations that is used for Theorem 3.21 can lead to an exponential increase in size. As shown in [13], this blowup can be avoided by using a more involved construction.

Finally, while we only mentioned this explicitly in Section 4.2.1, note that most of the other results in this paper can also be directly converted to the appropriate spanner representations that use vset- and vstk-automata from [7].

Notes

Following the terminology of [3]; literature also uses the term rational constraints.

References

Angluin, D.: Finding patterns common to a set of strings. J. Comput. Syst. Sci. 21, 46–62 (1980)
Article MathSciNet MATH Google Scholar
Barceló, P., Libkin, L., Lin, A.W., Wood, P.T.: Expressive languages for path queries over graph-structured data. ACM Trans. Database Syst. 37(4), 31 (2012)
Article Google Scholar
Barceló, P., Muñoz, P.: Graph Logics with Rational relations: The Role of Word Combinatorics. In: Proc. CSL-LICS 2014 (2014)
Bremer, J., Freydenberger, D.D.: Inclusion problems for patterns with a bounded number of variables. Inform. Comput. 220–221, 15–43 (2012)
Article MathSciNet MATH Google Scholar
Câmpeanu, C., Salomaa, K., Yu, S.: A formal study of practical regular expressions. Int. J. Found Comput. Sci. 14, 1007–1018 (2003)
Article MathSciNet MATH Google Scholar
Diekert, V.: Makanin’s Algorithm. In: Lothaire, M. (ed.) Algebraic Combinatorics on Words, chapter 12, pages 387–442. Cambridge University Press (2002)
Fagin, R., Kimelfeld, B., Reiss, F., Vansummeren, S.: Document spanners: A formal approach to information extraction. J. ACM 62(2), 12 (2015)
Article MathSciNet MATH Google Scholar
Fagin, R., Kimelfeld, B., Reiss, F., Vansummeren, S.: Declarative cleaning of inconsistencies in information extraction. ACM Trans. Database Syst. 41(1), 6 (2016)
Article MathSciNet Google Scholar
Fernau, H., Manea, F., Mercas, R., Schmid, M.L.: Pattern Matching with variables: Fast Algorithms and New Hardness Results. In: Proc. STACS 2015 (2015)
Fernau, H., Schmid, M.L.: Pattern matching with variables: A multivariate complexity analysis. Inf. Comput. 242, 287–305 (2015)
Article MathSciNet MATH Google Scholar
Fernau, H., Schmid, M.L., Villanger, Y.: On the parameterised complexity of string morphism problems. Theory Comput. Sys. (2015)
Freydenberger, D.D.: Extended regular expressions: Succinctness and decidability. Theory Comput. Sys. 53(2), 159–193 (2013)
Article MathSciNet MATH Google Scholar
Freydenberger, D.D.: A Logic for Document Spanners. In: Proc ICDT. Accepted (2017)
Freydenberger, D.D., Holldack, M.: Document spanners: From Expressive Power to Decision Problems. In: Proc. ICDT 2016, p 2016
Freydenberger, D.D., Reidenbach, D.: Bad news on decision problems for patterns. Inform. Comput. 208(1), 83–96 (2010)
Article MathSciNet MATH Google Scholar
Freydenberger, D.D., Schweikardt, N.: Expressiveness and static analysis of extended conjunctive regular path queries. J. Comput. Syst. Sci. 79(6), 892–909 (2013)
Article MathSciNet MATH Google Scholar
Friedl, J.E.F.: Mastering Regular Expressions. O’Reilly Media. 3rd edition (2006)
Garey, M.R., Johnson, D.S.: Computers and intractability. W. H. Freeman and Company (1979)
Ginsburg, S., Spanier, E.: Semigroups, presburger formulas, and languages. Pac. J. Math. 16(2), 285–296 (1966)
Article MathSciNet MATH Google Scholar
Grohe, M., Flum, J.: Parameterized complexity theory. Texts in Theoretical Computer Science. Springer (2006)
Hartmanis, J.: On gödel speed-up and succinctness of language representations. Theor. Comput. Sci. 26(3), 335–342 (1983)
Article MATH Google Scholar
Holzer, M., Kutrib, M.: Descriptional complexity–an introductory survey. Sci. Appl. Language Methods 2, 1–58 (2010)
MathSciNet MATH Google Scholar
Ibarra, O.H., Pong, T.-C., Sohn, S.M.: A note on parsing pattern languages. Pattern Recogn. Lett. 16(2), 179–182 (1995)
Article Google Scholar
Jiang, T., Kinber, E., Salomaa, A., Salomaa, K., Yu, S.: Pattern languages with and without erasing. Int. J Comput. Math. 50, 147–163 (1994)
Article MATH Google Scholar
Jiang, T., Salomaa, A., Salomaa, K., Yu, S.: Decision problems for patterns. J. Comput. Syst Sci. 50, 53–63 (1995)
Article MathSciNet MATH Google Scholar
Karhumȧki, J., Mignosi, F., Plandowski, W.: The expressibility of languages and relations by word equations. J. ACM 47(3), 483–505 (2000)
Article MathSciNet MATH Google Scholar
Kozen, D.: Lower Bounds for Natural Proof Systems. In: Proc. FOCS 1977, p 1977
Kozen, D.: Theory of computation. Springer-Verlag (2006)
Kutrib, M.: The phenomenon of non-recursive trade-offs. Int. J. Found. Comput. Sci. 16(5), 957–973 (2005)
Article MathSciNet MATH Google Scholar
Lothaire, M.: Combinatorics on Words. Cambridge University Press (1997)
Ohlebusch, E., Ukkonen, E.: On the equivalence problem for E-pattern languages. Theor. Comput. Sci. 186, 231–248 (1997)
Article MathSciNet MATH Google Scholar
Parikh, R.J.: On context-free languages. J. ACM 13(4), 570–581 (1966)
Article MATH Google Scholar
Reidenbach, D., Schmid, M.L.: Patterns with bounded treewidth. Inform. Comput. 239, 87–99 (2014)
Article MathSciNet MATH Google Scholar
Stephan, F., Yoshinaka, R., Zeugmann, T.: On the Parameterised Complexity of Learning Patterns. In: Proc. ISCIS 2011, p 2011

Download references

Acknowledgements

We thank Florin Manea for his suggestion to use word equations with regular constraints, and Thomas Zeume for reporting a list of typos. We also thank the anonymous reviewers of both this paper and the conference version for their feedback.

Author information

Authors and Affiliations

Loughborough University, Loughborough, UK
Dominik D. Freydenberger
Goethe University, Frankfurt am Main, Germany
Mario Holldack

Authors

Dominik D. Freydenberger
View author publications
You can also search for this author in PubMed Google Scholar
Mario Holldack
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dominik D. Freydenberger.

Additional information

This article is part of the Topical Collection on Special Issue on Database Theory

An preliminary version of this article appeared as [14]. Dominik D. Freydenberger was supported by Deutsche Forschungsgemeinschaft (DFG) under grant FR 3551/1-1.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Cite this article

Freydenberger, D.D., Holldack, M. Document Spanners: From Expressive Power to Decision Problems. Theory Comput Syst 62, 854–898 (2018). https://doi.org/10.1007/s00224-017-9770-0

Download citation

Published: 22 May 2017
Issue Date: May 2018
DOI: https://doi.org/10.1007/s00224-017-9770-0

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Document Spanners: From Expressive Power to Decision Problems

Abstract

Similar content being viewed by others

The Information Extraction Framework of Document Spanners - A Very Informal Survey

A Logic for Document Spanners

Word Equations in Synergy with Regular Constraints

1 Introduction

Related Work

Structure of the Paper

2 Preliminaries

2.1 Regexes (Extended Regular Expressions)

Definition 2.1

Definition 2.2

Definition 2.3

Example 2.4

2.2 Document Spanners

Example 2.5

Definition 2.6

Definition 2.7

Definition 2.8

Definition 2.9

Example 2.10

Definition 2.11

Example 2.12

Definition 2.13

Example 2.14

Definition 2.15

3 Expressibility Results

3.1 Pattern Languages

Definition 3.1

Example 3.2

Theorem 3.3

Proof

Example 3.4

3.2 Word Equations and Existential Concatenation Formulas

Definition 3.5

Definition 3.6

Proposition 3.7

Proof

Example 3.8

Definition 3.9

Example 3.10

Definition 3.11

Theorem 3.12

Proof

Claim 1

Proof

Claim 2

Proof

Theorem 3.13

Proof

Example 3.14

3.3 Xregexes

Definition 3.15

Definition 3.16

Lemma 3.17

Proof

Claim 1

Proof

Example 3.18

Lemma 3.19

Proof

Example 3.20

Theorem 3.21

Definition 3.22

Theorem 3.23

Proof

Lemma 3.24

Proof

Example 3.25

Theorem 3.26

Proof

Corollary 3.27

Corollary 3.28

4 Decision Problems

4.1 Spanner Evaluation

Theorem 4.1

Proof

Proposition 4.2

Proof