# Document Spanners: From Expressive Power to Decision Problems

**Part of the following topical collections:**

## Abstract

We examine *document spanners*, a formal framework for information extraction that was introduced by Fagin, Kimelfeld, Reiss, and Vansummeren (PODS 2013, JACM 2015). A document spanner is a function that maps an input string to a relation over *spans* (intervals of positions of the string). We focus on document spanners that are defined by *regex formulas*, which are basically regular expressions that map matched subexpressions to corresponding spans, and on *core spanners*, which extend the former by standard algebraic operators and string equality selection. First, we compare the expressive power of core spanners to three models – namely, *patterns*, *word equations*, and a rich and natural subclass of *extended regular expressions* (regular expressions with a repetition operator). These results are then used to analyze the complexity of query evaluation and various aspects of static analysis of core spanners. Finally, we examine the relative succinctness of different kinds of representations of core spanners and relate this to the simplification of core spanners that are extended with difference operators.

### Keywords

Information extraction Document spanners Regular expressions Xregex Patterns Word equations Decision problems Descriptional complexity## 1 Introduction

Information Extraction (IE) is the task of automatically extracting structured information from texts. This paper examines *document spanners* (also called *spanners*), a formalization of the IE query language AQL, which is used in IBM’s SystemT. Document spanners were introduced by Fagin et al. [7] in order to allow the theoretical examination of AQL, and were also used in [8].

A *span* is an interval on positions of a string *w*, and a *spanner* is a function that maps *w* to a relation over spans of *w*. A central topic of [7] and of the present paper are *core spanners* (according to Fagin et al., this name was chosen because core spanners capture the core of AQL).

The primitive building blocks of core spanners are *regex formulas*, which are regular expressions with variables. Each of these variables corresponds to a subexpression, and whenever a regex formula *α* matches a string *w*, each variable is mapped to the span in *w* that matches that subexpression. For example, consider the regex formula *α* := *x*{aaa} ⋅ a^{+} ⋅ *y*{a^{+}}, with terminal a, and variables *x* and *y*. When *α* matches a string *w*, it maps *x* to the span that contains the first three positions of *w*, and *y* to a span from some position after the third to the last position of *w*. Hence, each match of *α* on *w* determines a tuple of spans; and as there can be multiple matches of a regex formula to a string, this process creates a relation over spans of *w*. Core spanners are then defined by extending regex formulas with the relational operations projection, union, natural join, and string equality selection.

One of the two main topics of the present paper is the examination of decision problems for core spanners, in particular evaluation and static analysis. These results are mostly derived from the other main topic, the examination of the expressive power of core spanners in relation to three other models that use repetition operators, which act similar to the spanners’ string equality selection.

We begin with comparing core spanners to *patterns*. A pattern is word that consists of variables and terminals, and generates the language of all words that can be obtained by substitution of the variables with arbitrary terminal words. For example, the pattern *α* = *x**x*ab*y* (where *x* and *y* are variables, and a and b are terminals) generates the language of all words that have a prefix that consists of a square, followed by the word ab. Although pattern languages have a simple definition, various decision problems for them are surprisingly hard. For example, their membership problem is NP-complete (cf. Angluin [1], Jiang et al. [24]), and their inclusion problem is undecidable (cf. Bremer and Freydenberger [4]). As we show that core spanners can recognize pattern languages, this allows us to conclude that evaluation of Boolean core spanners is NP-hard, and that spanner containment is undecidable.

Next, we consider *word equations*, which are equations of the form *α* = *β*, where *α* and *β* are patterns. Word equations can be used to define languages and word relations. We show that word equations with regular constraints can express all relations that are expressible with core spanners. By using an improved version of Makanin’s algorithm (cf. Diekert [6]), this allows us to show that satisfiability and hierarchicality for core spanners can be decided in PSPACE. Moreover, using coding techniques from word equations, we show that two common relations from combinatorics on words can be selected with core spanners.

Finally, we examine the relation of core spanners to *xregexes* (also called *extended regular expressions*, *regexes*, or *regular expressions with backreferences* in literature). These are regular expressions that can use a repetition operator, that is available in most modern implementations for regular expressions (see, e. g., Friedl [17]) and that allows the definition of non-regular languages. For example, the xregex *x*{ *Σ*^{∗}} ⋅ &*x* ⋅ &*x* generates all cubic words over *Σ* , as *x*{ *Σ*^{∗}} generates some word *w* which is stored in the variable *x*, and each occurrence of &*x* repeats that *w*. As a consequence of this increase in expressive power, many decision problems are harder for xregexes than for their “classical” counterparts. In particular, various problems of static analysis are undecidable (Freydenberger [12]).

But as shown by Fagin et al. [7], core spanners cannot define all languages that are definable by xregexes. Intuitively, the reason for this is that xregexes can use their repetition operators inside a Kleene star, which allows them to repeat an arbitrary word an unbounded number of times – for example, the xregex *x*{ *Σ*^{∗}}⋅&*x*^{+} generates the language of all *w*^{n}, *n* ≥ 2. In contrast to this, core spanners have to express repetitions with variables and string equality selections. Inspired by this observation, we introduce *variable-star free (or vstar-free) xregexes* as those xregexes that neither define nor use variables inside a Kleene star. We show that every vstar-free xregex can be converted into an equivalent core spanner. Since all undecidability results by Freydenberger [12] also apply to vstar-free xregexes, these undecidability results carry over to core spanners. This also has various consequences for the minimization and the relative succinctness of classes of spanner representations. We also show that complementing a core spanner can lead to a size increase that is not bounded by any recursive function (for basically all natural notions of size). Although this does not solve an open problem by Fagin et al. [7] on the simplification of core spanners with difference operators, it shows that if simplification is possible, it has to be non-computable. As a further contribution, we also develop tools to prove inexpressibility for vstar-free regular expressions and for core spanners.

As we shall see, many of the observed lower bounds hold even for comparatively restricted classes of core spanners (in particular, most of the results hold for spanners that do not use join). Hence, the authors consider it reasonable to expect that these results can be easily adapted to other information extraction languages that combine regular expressions with capture variables and a string equality operator.

In addition to regex formulas, Fagin et al. [7] also consider two types of automata as basic building blocks of spanner representations. While the present paper does not discuss these in detail, most of the results on spanner representations that are based on regex formulas can be directly converted to the respective class of spanner representations that are based on automata.

### Related Work

For an overview of related models, we refer to Fagin et al. [7]. In addition to this, we highlight connections to models with similar properties. In [7], Fagin et al. showed that there is a language that can be defined by xregexes, but not by core spanners. Furthermore, they compared the expressive power of core spanners and a variant of conjunctive regular path queries (CRPQs), a graph querying language. Barceló et al. [2] introduced extended CRPQs (ECRPQs), which can compare paths in the graph with regular relations. While there is no direct connection between ECRPQs and core spanners, both models share the basic idea of combining regular languages with a comparison operator that can express string equality. As shown by Freydenberger and Schweikardt [16], ECRPQs have undecidability results that are comparable to those in the present paper, and to those for xregexes (cf. Freydenberger [12]). Furthermore, Barceló and Muñoz [3] have used word equations with regular constraints for variants of CRPQs.

Also note that Freydenberger [13] extends the results on the connection between word equations and core spanners from the present paper into a logic on words that has the same expressive power as core spanners.

### Structure of the Paper

In Section 2, we give definitions of xregexes and of core spanners. Section 3 compares the expressive power of core spanners to patterns, word equations, and vstar-free regular expressions. The results from this section are then used in Section 4 to examine the complexity of evaluation and static analysis of spanners. We also examine the consequences of these results to the relative succinctness of different spanner representations. Section 5 concludes the paper.

## 2 Preliminaries

Let \(\mathbb {N}\) and \(\mathbb {N}_{>0}\) be the sets of non-negative and positive integers, respectively. Let *Σ* be a fixed finite alphabet of *(terminal) symbols*. Except when stated otherwise, we assume | *Σ* | ≥ 2. We use *ε* to denote the *empty word*. For every word *w* ∈ *Σ*^{∗} and every *a* ∈ *Σ* , let |*w*| denote the length of *w*, and |*w*|_{a} the number of occurrences of *a* in *w*. A word *x* ∈ *Σ*^{∗} is a *subword* of a word *y* ∈ *Σ*^{∗} if there exist *u*, *v* ∈ *Σ*^{∗} with *y* = *u**x**v* . A word *x* ∈ *Σ*^{∗} is a *prefix* of a word *y* ∈ *Σ*^{∗} if there exists a *v* ∈ *Σ*^{∗} with *y* = *x**v* , and a *proper prefix* if it is a prefix and *x* ≠ *y*. For every \(n\in \mathbb {N}\), an *n-ary word relation (over Σ)* is a subset of (*Σ*^{∗})^{n}.

### 2.1 Regexes (Extended Regular Expressions)

This section introduces the syntax and semantics of xregexes, which we shall also use for regex formulas in Section 2.2. We begin with the syntax, which follows the definition from [7].

**Definition 2.1**

*X*of

*variables*and define the set

*M*of

*meta symbols*as

*M*:= {

*ε*,

*∅*, (,), {,}, ⋅, ∨,

^{∗}, &}. Let

*Σ*,

*X*, and

*M*be pairwise disjoint. The set of

*x regexes (extended regular expressions)*is defined as follows:

- 1.
The symbols

*∅*and*ε*, and every*a*∈*Σ*are xregexes. - 2.
If

*α*_{1}and*α*_{2}are xregex, then (*α*_{1}⋅*α*_{2})*(concatenation)*, (*α*_{1}∨*α*_{2})*(disjunction)*, and \((\alpha _{1}^{*})\)*(Kleene star)*are xregexes. - 3.
For every

*x*∈*X*and every xregex*α*that contains neither*x*{⋯} nor &*x*as a subword,*x*{*α*} is an xregex*(variable binding)*. - 4.
For every

*x*∈*X*, we have that &*x*is an xregex*(variable reference)*.

*β*of an xregex

*α*is an xregex itself, we call

*β*a

*subexpression (of α)*. The set of all subexpressions of

*α*is denoted by Sub(

*α*), and the set of variables occurring in variable bindings in an xregex

*α*is denoted by Vars(

*α*). If an xregex

*α*contains neither variable references, nor variable bindings, we call

*α*a

*proper regular expression*.

In other words, we use the term “proper” to distinguish those expressions that are usually just called “regular expressions” from the more general extended regular expressions. We use the notation *α*^{+} as a shorthand for *α* ⋅ *α*^{∗}. Parentheses can be added freely. We may also omit parentheses and the concatenation operator, where we assume ∗ and + are taking precedence over concatenation, and concatenation precedes disjunction. Furthermore, we use *Σ* as a shorthand for the regular expression \(\bigvee _{a\in \Sigma } a\).

Before introducing the semantics of xregexes formally, we give an intuitive explanation. An expression of the form *α* = *x*{*β*} matches the same strings as *β*, but *α* additionally stores the matched string in the variable *x*. Using a variable reference &*x*, this string can then be repeated. For example, let *α* := (*x*{ *Σ*^{∗}} ⋅ &*x*). The subexpression *x*{ *Σ*^{∗}} matches any string *w* ∈ *Σ*^{∗} and stores this match in *x*. The following variable reference &*x* repeats the stored *w*. Thus, *α* defines the (non-regular) *copy-language* {*w**w*∣*w* ∈ *Σ*^{∗}}.

The following definition of the semantics of xregexes is based on the semantics by Freydenberger [12], which is an adaption of the semantics from Câmpeanu et al. [5] (the former uses variables, the latter backreferences). In comparison to [12], the case for Kleene star has been changed, in order to make the definition compatible with the parse trees for regex formulas from Fagin et al. [7].

**Definition 2.2**

*γ*be an xregex over

*Σ*and

*X*. A

*γ-parse tree*is a finite, directed, and ordered tree

*T*

_{γ}. Its nodes are labeled with tuples of the form (

*w*,

*γ*′) ∈ (

*Σ*

^{∗}× Sub(

*γ*)). The root of every

*γ*-parse tree

*T*

_{γ}is labeled (

*w*,

*γ*) with

*w*∈

*Σ*

^{∗}; and the following rules must hold for each node

*v*of

*T*

_{γ}:

- 1)
If

*v*is labeled (*w*,*a*) with*a*∈ (*Σ*∪ {*ε*}), then*v*is a leaf, and*w*=*a*. - 2)
If

*v*is labeled (*w*, (*β*_{1}⋅*β*_{2})), then*v*has exactly one left child*v*_{1}and exactly one right child*v*_{2}with respective labels (*w*_{1},*β*_{1}) and (*w*_{2},*β*_{2}), and*w*=*w*_{1}*w*_{2}. - 3)
If

*v*is labeled (*w*, (*β*_{1}∨*β*_{2})), then*v*has a single child, labeled (*w*,*β*_{1}) or (*w*,*β*_{2}). - 4)
If

*v*is labeled (*w*,*β*^{∗}), then one of the following cases holds: (a)*w*=*ε*, and*v*is a leaf, or (b)*w*=*w*_{1}*w*_{2}…*w*_{k}for words*w*_{1}, …,*w*_{k}∈*Σ*^{+}(with*k*≥ 1), and*v*has*k*children*v*_{1}, …,*v*_{k}(ordered from left to right) that are labeled (*w*_{1},*β*), …, (*w*_{k},*β*). - 5)
If

*v*is labeled (*w*,*x*{*β*}), then*v*has a single child, labeled (*w*,*β*). - 6)
If

*v*is labeled (*w*, &*x*), let ≺ denote the post-order of the nodes of*T*_{γ}(that results from a left-to-right, depth-first traversal). Then one of the following cases applies: (a) If there is no node*v*′ with*v*′ ≺*v*that is labeled (*w*′,*x*{*β*′}) ∈*Σ*^{∗}× Sub(*γ*), then*v*is a leaf, and*w*=*ε*. (b) Otherwise, let*v*′ be the node with*v*′ ≺*v*that is ≺-maximal among nodes labeled (*w*′,*x*{*β*′}). Then*v*is a leaf, and*w*=*w*′.

*γ*-parse tree

*T*

_{γ}is labeled (

*w*,

*γ*), we call

*T*

_{γ}a

*γ*

*-parse tree for w*. If the context is clear, we omit

*γ*and call

*T*

_{γ}a parse tree.

*∅*, and references to unbound variables (i. e., variables that were not assigned a value with a variable binding operator) default to

*ε*. For an example of a parse tree, see Fig. 1.

We use parse trees to define the semantics of xregexes:

**Definition 2.3**

An xregex *γ* recognizes the language \(\mathcal {L}(\gamma )\) of all *w* ∈ *Σ*^{∗} for which there exists a *γ*-parse tree *T*_{γ} with (*w*, *γ*) as root label.

*Example 2.4*

Consider the xregexes *α* := *x*{*Σ*^{+}}⋅(&*x*)^{+}, *β* := *x*{*Σ*^{+}}⋅&*x* ⋅ *x*{*Σ*^{+}}⋅&*x*, and *γ* := *x*{*a**a*^{+}}⋅(&*x*)^{+} for some *a* ∈ *Σ*.

Then \(\mathcal {L}(\alpha )=\{w^{n}\mid w\in \Sigma ^{+}, n\geq 2\}\), \(\mathcal {L}(\beta )=\{x_{1}x_{1}x_{2}x_{2}\mid x_{1},x_{2}\in \Sigma ^{+}\}\) , and \(\mathcal {L}(\gamma )=\{a^{n}\mid n\geq 2, \text {\textit {n} is not prime}\}.\)

### 2.2 Document Spanners

Let *w* := *a*_{1}*a*_{2}⋯*a*_{n} be a word over *Σ*, with \(n\in \mathbb {N}\) and *a*_{1}, …, *a*_{n} ∈ *Σ*. A *span of w* is an interval [*i*, *j*〉 with 1 ≤ *i* ≤ *j* ≤ *n* + 1 and \(i,j \in \mathbb {N}\). For each span [*i*, *j*〉 of *w*, we define a subword *w*_{[i,j〉} := *a*_{i}⋯*a*_{j−1}. In other words, each span describes a subword of *w* by its bounding indices. Two spans [*i*, *j*〉 and [*i*′, *j*′〉 of *w* are equal if and only if *i* = *i*′ and *j* = *j*′. These spans *overlap* if *i* ≤ *i*′ < *j* or *i*′ ≤ *i* < *j*′, and are *disjoint*, otherwise. The span [*i*, *j*〉 *contains* the span [*i*′, *j*′〉 if *i* ≤ *i*′ ≤ *j*′ ≤ *j*. The *set of all spans of w* is denoted by Spans(*w*).

*Example 2.5*

Let *w* := aabbcabaa. As |*w*| = 9, both [1, 3〉 and [8, 10〉 are spans of *w*, but [10, 11〉 is not. Although *w*_{[1,3〉} = *w*_{[8,10〉} = aa, the first two spans are not equal. Likewise, the two spans [3, 3〉 and [5, 5〉 are not equal, even though *w*_{[3,3〉} = *w*_{[5,5〉} = *ε*. The whole word *w* is described by the span [1, 10〉.

**Definition 2.6**

Let SVars be a fixed, infinite set of *span variables*, where *Σ* and SVars are disjoint. Let *V* ⊂ SVars be a finite subset of SVars, and let *w* ∈ *Σ*^{∗}. A (*V*, *w*)*-tuple* is a function *μ*: *V* → Spans(*w*), that maps each variable in *v* to a span of w. If context allows, we write *w*-tuple instead of (*V*, *w*)-tuple. A set of (*V*, *w*)-tuples is called a (*V*, *w*)*-relation*.

As *V* and Spans(*w*) are finite, every (*V*, *w*)-relation is finite by definition. Our next step is the definition of document spanners, which map words *w* to (*V*, *w*)-relations:

**Definition 2.7**

Let *V* and *Σ* be alphabets of variables and symbols, respectively. A *(document) spanner* is a function *P* that maps every word *w* ∈ *Σ*^{∗} to a (*V*, *w*)-relation *P*(*w*). Let *V* be denoted by SVars(*P*). A spanner *P* is *n-ary* if |SVars(*P*)| = *n*, and *Boolean* if SVars(*P*) = *∅*. For all *w* ∈ *Σ*^{∗}, we say *P*(*w*) = True and *P*(*w*) = False instead of *P*(*w*) = {()} and *P*(*w*) = *∅*, respectively.

A *w*-tuple *μ* ∈ *P*(*w*) is *hierarchical* if for all *x*, *y* ∈ SVars(*P*) at least one of the following holds: (1) The span *μ*(*x*) contains *μ*(*y*), (2) the span *μ*(*y*) contains *μ*(*x*), or (3) the spans *μ*(*x*) and *μ*(*y*) are disjoint. A spanner *P* is *hierarchical* if, for every *w* ∈ *Σ*^{∗}, every *μ* ∈ *P*(*w*) is hierarchical.

A spanner *P* is *total on w* if *P*(*w*) contains all *w*-tuples over SVars(*P*). Let *Y* ⊂ SVars be a finite set of variables. The *universal spanner over Y* is denoted by Υ_{Y}. It is the unique spanner *P*′ such that SVars(*P*′) = *Y* and *P*′ is total on every *w* ∈ *Σ*^{∗}. Furthermore, a spanner *P* is *hierarchical total on w* if *P*(*w*) is exactly the set of all hierarchical *w*-tuples over SVars(*P*); and the *universal hierarchical spanner* over a set *Y* is the unique spanner \({\Upsilon }^{\mathbf {H}}_{Y}\) that is hierarchical total on every *w* ∈ *Σ*^{∗}.

For two spanners *P*_{1} and *P*_{2}, we write *P*_{1} ⊆ *P*_{2} if *P*_{1}(*w*) ⊆ *P*_{2}(*w*) for every *w* ∈ *Σ*^{∗}, and *P*_{1} = *P*_{2} if *P*_{1}(*w*) = *P*_{2}(*w*) for every *w* ∈ *Σ*^{∗}.

Hence, a spanner can be understood as a function that maps a word *w* to a set of functions, each of which assigns spans of *w* to the variables of the spanner. As Boolean spanners are functions that map words to truth values, they can be interpreted as characteristic functions of languages. For every Boolean spanner *P*, we define the *language recognized by P* as \(\mathcal {L}(P):=\{w\in \Sigma ^{*}\mid P(w)=\texttt {True}\}\). We extend this to arbitrary spanners *P* by \(\mathcal {L}(P):=\{w\in \Sigma ^{*}\mid P(w)\neq \emptyset \}\).

**Definition 2.8**

A *regex formula* is an xregex *α* over *Σ* and *X* := SVars such that *α* does not contain any variable references, and for every *β* ∈ Sub(*α*) with *β* = *γ*^{∗}, no subexpression of *γ* may be a variable binding.

In other words, a regex formula is a proper regular expression that is extended with variable binding operators, but these operators may not occur inside a Kleene star. We define SVars(*γ*) := Vars(*γ*) for all regex formulas *γ*.

To define the semantics of regex formulas, we use the definition of parse trees for xregexes, see Definition 2.2. Intuitively, the goal of this definition is that each occurrence of a variable *x* in a *γ*-parse tree is matched to the corresponding span. Here, two problems can arise. Firstly, a variable might not occur in the parse tree; for example, when matching the regex formula (*x*{a} ∨ bb) to the word bb. Secondly, a variable might be defined too often, as e. g. in the regex formula *x*{*Σ*^{+}} ⋅ *x*{*Σ*^{+}}. In order to avoid such problems, we introduce the notion of a functional regex formula.

**Definition 2.9**

Let *γ* be a regex formula. We call *γ functional* if for every *w* ∈ *Σ*^{∗} and every *γ*-parse tree *T*_{γ} for *w*, for each variable *x* ∈ SVars(*γ*), there exactly one node of *T*_{γ} has a label of the form (*v*, *x*{*β*}), where *v* is a subword of *w* and *β* is a sub-regex formula of *γ*. The class of all functional regex formulas is denoted by RGX.

As shown in Proposition 3.5 in Fagin et al. [7], functionality has a straightforward syntactic characterization: Basically, variables may not be redeclared, variables may not be used inside of Kleene stars, and if variables are used in a disjunction, each side of a disjunction has to bind exactly the same variables. Consider the following example:

*Example 2.10*

The regex formula *γ*_{1} := (*x*{a} ∨ *x*{b}) is functional even though it contains two occurrences of variable definitions for *x*. There are just two *γ*_{1}-parse trees, both of which only contain one node labeled (*c*, *x*{*c*}), where *c* ∈ {a, b}. As a trivial case, even *γ*_{2} := *x*{*∅*} is functional (as no *γ*_{2}-parse tree exists). Furthermore, the regex formulas *γ*_{3} := *x*{(a ∨ b)^{∗}} ⋅ *x*{b^{+}} and *γ*_{4} := a^{∗} ∨ *x*{b} are not functional. Finally, *γ*_{5} := *x*{a}^{∗} is not a regex formula at all.

For functional regex formulas, we use parse trees to define the semantics:

**Definition 2.11**

*γ*be a functional regex formula and let

*T*be a

*γ*-parse tree for a word

*w*∈

*Σ*

^{∗}. For every node

*v*of

*T*, the subtree that is rooted at

*v*naturally maps to a span

*p*(

*v*) of

*w*. As

*γ*is functional, for every

*x*∈ SVars(

*γ*), exactly one node

*v*

_{x}of

*T*has a label that contains

*x*. We define

*μ*

^{T}: SVars(

*γ*) → Spans(

*w*) by

*μ*

^{T}(

*x*) :=

*p*(

*v*

_{x}). Each

*γ*∈ RGX defines a spanner [[

*γ*]] by

*w*∈

*Σ*

^{∗}.

*Example 2.12*

*Σ*. We define the regex formula

*w*:= baaba. Then [[

*α*]](

*w*) consists of ([2, 4〉, [3, 3〉, [3, 4〉), ([2, 5〉, [3, 4〉, [4, 5〉), ([2, 6〉, [3, 5〉, [5, 6〉), ([3, 5〉, [4, 4〉, [4, 5〉), and ([3, 6〉, [4, 5〉, [5, 6〉).

For every *w* ∈ *Σ*^{∗}, a spanner *P* defines a (*V*, *w*)-relation *P*(*w*). In order to construct more sophisticated spanners, we introduce spanner operators.

**Definition 2.13**

*P*,

*P*

_{1},

*P*

_{2}be spanners and let

*w*∈

*Σ*

^{∗}. The algebraic operators

*union, projection, natural join and selection*are defined as follows.

- Union:
Two spanners

*P*_{1}and*P*_{2}are*union compatible*if SVars(*P*_{1}) = SVars(*P*_{2}), and their*union*(*P*_{1}∪*P*_{2}) is defined by SVars(*P*_{1}∪*P*_{2}) := SVars(*P*_{1}) = SVars(*P*_{2}) and (*P*_{1}∪*P*_{2})(*w*) :=*P*_{1}(*w*) ∪*P*_{2}(*w*) for every*w*∈*Σ*^{∗}.- Projection:
Let

*Y*⊆ SVars(*P*). The*projection π*_{Y}*P*is defined by SVars(*π*_{Y}*P*) :=*Y*and*π*_{Y}*P*(*w*) :=*P*(*w*)|_{Y}for all*w*∈*Σ*^{∗}, where*P*(*w*)|_{Y}is the restriction of all*w*-tuples in*P*(*w*) to Y .- Natural join:
Let

*V*_{i}:= SVars(*P*_{i}) for*i*∈ {1, 2}. The*(natural) join*(*P*_{1}⋈*P*_{2}) of*P*_{1}and*P*_{2}is defined by SVars(*P*_{1}⋈*P*_{2}) := SVars(*P*_{1}) ∪ SVars(*P*_{2}) and, for all*w*∈*Σ*^{∗}, we define (*P*_{1}⋈*P*_{2})(*w*) as the set of all (*V*_{1}∪*V*_{2},*w*)-tuples*μ*for which there exist (*V*_{i},*w*)-tuples*μ*_{1}and*μ*_{2}with \({\mu }(w)|_{V_{1}} = \mu _{1}(w)\) and \({\mu }(w)|_{V_{2}} = \mu _{2}(w)\).- Selection:
Let

*R*⊆ (*Σ*^{∗})^{k}be a*k*-ary relation over*Σ*^{∗}. The*selection operator ζ*^{R}is parameterized by*k*variables*x*_{1}, …,*x*_{k}∈ Vars(*P*), written as \(\zeta ^{R}_{x_{1},\dots ,x_{k}}\). The*selection*\(\zeta ^{R}_{x_{1},\dots ,x_{k}} P\) is defined by \(\textsf{SVars}(\zeta ^{R}_{x_{1},\dots ,x_{k}} P) := {\textsf{SVars}\left (P\right )}\) and, for all*w*∈*Σ*^{∗}, we define \(\zeta ^{R}_{x_{1},\dots ,x_{k}} P(w)\) as the set of all*μ*∈*P*(*w*) for which \(\left (w_{\mu (x_{1})}, \dots , w_{\mu (x_{k})}\right ) \in R\).

Like [7], we mostly consider the *string equality selection* operator *ζ*^{=}. Hence, unless otherwise noted, the term “selection” refers to selection by the *n*-ary string equality relation. Note that unlike selection (which compares strings), join requires that the spans are identical.

The join *P*_{1} ⋈ *P*_{2} of two spanners *P*_{1} and *P*_{2} is equivalent to the intersection *P*_{1} ∩ *P*_{2} if SVars(*P*_{1}) = SVars(*P*_{2}), and to the Cartesian Product *P*_{1} × *P*_{2} if SVars(*P*_{1}) and SVars(*P*_{2}) are disjoint. Hence, if applicable, we write ∩ and × instead of ⋈.

For convenience, we may add and omit parentheses. We assume there is an order of precedence with projection and selection ranking over join ranking over union, e.g. we may write \(\pi _{Y} \zeta ^=_{x,y} P_{1} \cup P_{2} \bowtie P_{3}\) instead of \((\pi _{Y} \zeta ^=_{x,y} P_{1} \cup (P_{2} \bowtie P_{3}))\), where projection and selection are applied to *P*_{1}, and the result is united with the join of *P*_{2} and *P*_{3}.

*Example 2.14*

Let \(P_{1}:= \zeta ^=_{x,y} {\left [{\kern -2.3pt}[ x\{\Sigma ^{*}\} y\{\Sigma ^{*}\} \right ]{\kern -2.3pt}]}\) and \(P_{2}:= \zeta ^=_{x,y,z}{\left [{\kern -2.3pt}[x\{\Sigma ^{*}\} y\{\Sigma ^{*}\} z\{\Sigma ^{*}\} \right ]{\kern -2.3pt}]}\) . Then \(\mathcal {L}(P_{1})=\{ww\mid w\in \Sigma ^{*}\}\) , and the variables *x* and *y* refer to the span of the first and second occurrence of *w*, respectively. Analogously, \(\mathcal {L}(P_{2})=\{w^{3}\mid \in \Sigma ^{*}\}\) (and *z* refers to the third occurrence of *w*). Assume that we want to construct a spanner for the language {*w*^{n}∣*w* ∈ *Σ*^{∗}, *n* ∈ {2, 3}}. As *P*_{1} and *P*_{2} are not union compatible, we cannot simply define *P*_{1} ∪ *P*_{2}. Union compatibility can be achieved by projecting *P*_{2} onto the set of common variables (i. e., *π*_{{x,y}}*P*_{2}).

**Definition 2.15**

A *spanner algebra* is a finite set of spanner operators. If O is a spanner algebra, then RGX^{O} denotes the set of all *spanner representations* that can be constructed by (repeated) combination of the symbols for the operators from O with regex formulas from RGX. For each operator *o* ∈ O and each spanner representation of the form *o**ρ* (if *o* is unary) or *ρ*_{1}*o**ρ*_{2} (if *o* is binary), we define [[*o**ρ*]] := *o*[[*ρ*]] or [[*ρ*_{1}*o**ρ*_{2}]] := [[*ρ*_{1}]] *o* [[*ρ*_{2}]], respectively. Furthermore, [[RGX^{O}]] is the closure of [[RGX]] under the spanner operators in O.

We define \(\mathcal {L}(\rho ):=\mathcal {L}([{\kern -2.3pt}[ \rho ]{\kern -2.3pt}])\) for every spanner representation *ρ*. Fagin et al. [7] refer to [[RGX]] as the class of *hierarchical regular spanners* and to [[RGX^{{π, ∪, ⋈}}]] as the class of *regular spanners*. In addition to (hierarchical) regular spanners, Fagin et al. also introduced the so-called *core spanners*, which are obtained by combining regex formulas with the four algebraic operators projection, selection, union, and join – in other words, the class of core spanners is the class \([{\kern -2.3pt}[\textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}]{\kern -2.3pt}]\). Analogously, \(\textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}\) is the class of *core spanner representations*.

## 3 Expressibility Results

### 3.1 Pattern Languages

We begin our examination of the expressive power of core spanners by comparing them to one of the simplest mechanisms with repetition operators:

**Definition 3.1**

*X*be an infinite variable alphabet that is disjoint from

*Σ*. A

*pattern*is a word

*α*∈ (

*Σ*∪

*X*)

^{+}that generates the language

*pattern substitution*is a homomorphism

*Σ*: (

*Σ*∪

*X*)

^{∗}→

*Σ*

^{∗}with

*σ*(

*a*) =

*a*for all

*a*∈

*Σ*. We denote the set of all variables in

*α*by Vars(

*α*).

Intuitively, a pattern *α* generates exactly those words that can be obtained by replacing the variables in *α* with terminal words homomorphically (i. e., multiple occurrences of the same variable have to be replaced in the same way). This type of pattern languages is also called *erasing pattern language* (cf. Jiang et al. [24]).

*Example 3.2*

Let *x*, *y* ∈ *X* and a, b ∈ *Σ*. The patterns *α* := *x**x* and *β* := *x*a*y*b*x* generate the languages \(\mathcal {L}(\alpha )=\{ww\mid w\in \Sigma ^{*}\}\) and \(\mathcal {L}(\beta )=\{v \mathtt {a} w \mathtt {b} v\mid v,w\in \Sigma ^{*}\}.\)

From every pattern *α*, we can straightforwardly construct an xregex for \(\mathcal {L}(\alpha )\). A similar observation holds for core spanners:

**Theorem 3.3**

*There is an algorithm that, given a pattern α, computes in polynomial time*\(\rho _{\alpha }\in \textsf{RGX}^{\{\zeta ^=\}}\)*such that*\(\mathcal {L}(\rho _{\alpha })=\mathcal {L}(\alpha )\).

### Proof

*α*=

*α*

_{1}⋯

*α*

_{n}with \(n\in \mathbb {N}_{>0}\) and

*α*

_{1}, …,

*α*

_{n}∈ (

*Σ*∪

*X*). We rewrite

*α*into a regex formula \(\hat {\alpha }\), by replacing the i-th occurrence of a variable x with a binding

*x*

_{i}{

*Σ*

^{∗}}. More formally, we define \(\hat {\alpha }:= \hat {\alpha }_{1}{\cdots } \hat {\alpha }_{n}\), where for each

*i*∈ {1, …,

*n*}, the regex formula \(\hat {\alpha }_{i}\) is defined as follows:

- 1.
If

*α*_{i}is a terminal (i. e., there is an*a*∈*Σ*with*α*_{i}=*a*), let \(\hat {\alpha }_{i}:= a\). - 2.
If

*α*_{i}is the*j*-th occurrence of a variable*x*∈*X*in*α*, let \(\hat {\alpha }_{i}:= x_{j}\{\Sigma ^{*}\}\).

We now define *S* to be a sequence of selections; where *S* contains exactly the selections \(\zeta ^=_{x_{1},\ldots ,x_{k}}\) for each *x* ∈ Vars(*α*) with |*α*|_{x} = *k* and *k* ≥ 2. In other words, for each *x* that occurs more than once in *α*, we include a selection of all *x*_{i}.

Finally, we define \(\rho _{\alpha }:= S \hat {\alpha }\). It is easy to see that \(\mathcal {L}(\rho _{\alpha })=\mathcal {L}(\alpha )\): For every \(w\in \mathcal {L}(\alpha )\), we can use a pattern substitution *σ* with *σ* (*α*) to construct a corresponding *w*-tuple *μ* for *ρ*_{α}. Likewise, for every \(w\in \mathcal {L}(\rho _{\alpha })\), there exists a corresponding *w*-tuple *μ* from which we can reconstruct a pattern substitution *σ* with *σ* (*α*) = *w*: By the construction of *ρ*_{α}, for each pair of variables *x*_{i}, *x*_{j} in \(\hat {\alpha }\), the words \(w_{\mu (x_{i})}\) and \(w_{\mu (x_{j})}\) must be identical. This allows us to define \(\sigma (x):= w_{\mu (x_{1})}\). □

*Example 3.4*

Let *x*, *y*, *z* ∈ *X*, a, b ∈ *Σ*, and define the pattern *α* := *x*a*y**y*b*x**z**x*. The construction in the proof of Theorem 3.3 leads to the spanner representation \(\zeta ^=_{x_{1},x_{2},x_{3}}\zeta ^=_{y_{1},y_{2}} \gamma \), where *γ* = *x*_{1}{*Σ*^{∗}}⋅a⋅*y*_{1}{*Σ*^{∗}}⋅*y*_{2}{*Σ*^{∗}}⋅b⋅*x*_{2}{*Σ*^{∗}}⋅*z*_{1}{*Σ*^{∗}}⋅*x*_{3}{*Σ*^{∗}}.

While the construction in the proof of Theorem 3.3 is so simple that it might not seem noteworthy, it will prove quite useful: In contrast to their simple definition, many canonical decision problems for them are surprisingly hard. Via Theorem 3.3, the corresponding lower bounds also apply to spanners, as we discuss in Sections 4.1 and 4.2.

### 3.2 Word Equations and Existential Concatenation Formulas

In this section, we introduce word equations, which are equations of patterns (cf. Definition 3.1) and can be used to define languages and relations, cf. Karhumäki et al. [26]:

**Definition 3.5**

A *word equation* is a pair *η* := (*η*_{L}, *η*_{R}) of patterns *η*_{L} and *η*_{R}. A pattern substitution *σ* is a *solution* of *η* if *σ* (*η*_{L}) = *σ* (*η*_{R}). We define Vars(*η*) := Vars(*η*_{L}) ∪ Vars(*η*_{R}). For *k* ≥ 1, a relation *R* ⊆ (*Σ*^{∗})^{k} is defined by a word equation *η* := (*η*_{L}, *η*_{R}) if there exist variables *x*_{1}, …, *x*_{k} ∈ Vars(*η*) such that \(R=\left \{\left (\sigma (x_{1}),\ldots ,\sigma (x_{k})\right )\mid \text {\(\sigma \) is a solution of \(\eta \)}\right \}.\)

We also write (*η*_{L}, *η*_{R}) as *η*_{L} = *η*_{R}. As we shall see just after the next definition both sides of the equation may have common variables. The following relations are well known examples of relations that are definable by word equations:

**Definition 3.6**

*Σ*

^{∗}, we define relations

As shown in Lothaire [30], the relation *R*_{com} is defined by the equation *x**y* = *y**x*, and *R*_{cyc} is defined by the equation *x**z* = *z**y*.

Let *R* be a *k*-ary string relation, and let *C* be a class of spanners. We say that *R* is *selectable* by *C*, if for every spanner *P* ∈ *C* and every sequence of variables **x** = (*x*_{1}, …, *x*_{k}) with *x*_{1}, …, *x*_{k} ∈ SVars(*P*), the spanner \(\zeta ^{R}_{\mathbf {x}} P\) is also in *C*.

**Proposition 3.7**

*The relations**R*_{com}*and**R*_{cyc}*are selectable by core spanners.*

### Proof

Both parts of the proof use a technique from [7]. Let **x** = *x*_{1}, ..., *x*_{k} be a sequence of distinct span variables (*k* ≥ 1), and let *X* := {*x*_{1}, …, *x*_{k}}. The spanner \(\zeta ^{R}_{\mathbf {x}} {\Upsilon }_{X}\) is called the *R-restricted universal spanner* over **x**, and is denoted by \({\Upsilon }^{R}_{\mathbf {x}}\). According to Proposition 4.15 in [7], in order to show that a *R* is selectable by core spanners, it suffices to show that \({\Upsilon }^{R}_{\mathbf {x}}\) is a core spanner for every **x** ∈ SVars^{k}.

*R*

_{cyc}: Note that for all

*x*,

*y*∈

*Σ*

^{∗}, the word x is a cyclic permutation of y (and vice versa) if and only if there exist

*u*,

*v*∈

*Σ*

^{∗}with

*x*=

*u*

*v*and

*y*=

*v*

*u*(see e. g. Lothaire [30]). Hence we can define the core spanner \(P_{\text {cyc}}:= \pi _{\{x,y\}} \hat {P}\), where

*α*

_{x}and

*α*

_{y}are defined as

*w*∈

*Σ*

^{∗}and every

*μ*∈

*P*

_{cyc}(

*w*), there exists a \(\hat {\mu }\in \hat {P}(w)\) with \(\mu (x)=\hat {\mu }(x)\) and \(\mu (y)=\hat {\mu }(y)\). The selections enforce \(u:= w_{\hat {\mu }(u_{1})}=w_{\hat {\mu }(u_{2})}\) and \(v:= w_{\hat {\mu }(v_{1})}=w_{\hat {\mu }(v_{2})}\). Hence,

*w*

_{μ(x)}=

*u*

*v*and

*w*

_{μ(y)}=

*v*

*u*, which means that (

*w*

_{μ(x)},

*w*

_{μ(y)}) ∈

*R*

_{cyc}, and \(\mu \in {\Upsilon }^{R_{\text {cyc}}}_{x,y}(w)\). For the other direction, we can show analogously that every \(\mu \in {\Upsilon }^{R_{\text {cyc}}}_{x,y}(w)\) can be extended into a \(\hat {\mu }\in \hat {P}(w)\), which then proves

*μ*∈

*P*

_{cyc}(

*w*).

*R*

_{com}: This proof relies on another fact from combinatorics on words. For all

*x*,

*y*∈

*Σ*

^{∗}, the equation

*x*

*y*=

*y*

*x*holds if and only if (

*x*,

*y*) ∈

*R*

_{com}(again, see Lothaire [30]). We define a core spanner \(P_{\text {com}}:= \pi _{\{x,y\}}\hat {P}\), where

*α*

_{1}, …,

*α*

_{4}are defined as

*μ*∈

*P*

_{com}(

*w*) for some

*w*∈

*Σ*

^{∗}. Again, this means that there exists a \(\hat {\mu }\in \hat {P}(w)\) with \(\mu (x)=\hat {\mu }(x)\) and \(\mu (y)=\hat {\mu }(y)\). In a slight abuse of notation, we identify the variables \(x,\hat {x},y,\hat {y}\) with the corresponding subwords of

*w*. In other words, we define \(x,\hat {x},y,\hat {y}\in \Sigma ^{*}\) by \(z := w_{\hat {\mu }(z)}\) for \(z\in \{x,\hat {x},y,\hat {y}\}\). Furthermore, let \(r=w_{\hat {\mu }(r_{1})}\). Due to the equality selections, we obtain the following word equations from

*α*

_{1}to

*α*

_{4}:

*α*

_{1}, we know that \(w_{\mu (x)} = w_{\mu (\hat {x})}\cdot w_{\mu (r_{1})}\) holds. Likewise, the structure of

*α*

_{2}ensures that \(w_{\mu (x_{2})} = w_{\mu (r_{2})}\cdot w_{\mu (\hat {x}_{2})}.\) Due to the selections \(\zeta ^=_{r_{1},r_{2},r_{3},r_{4}}\), \(\zeta ^=_{x,x_{2}}\), and \(\zeta ^=_{\hat {x},\hat {x}_{2}}\), the latter can be expressed as \(w_{\mu (x)} = w_{\mu (r_{1})}\cdot w_{\mu (\hat {x})},\) and by combining the two equations while abusing the notation as explained above, we obtain \(x=\hat {x}r=r\hat {x}\). The second equation is obtained analogously.

As \(\hat {x}r = r \hat {x}\), there exists a word *u* ∈ *Σ*^{∗} with \(r,\hat {x}\in \{u\}^{*}\). We choose the shortest *u* for which *r* ∈ {*u*}^{∗}. Then, due to \(\hat {y} r = r \hat {y}\), we have that \(\hat {y}\in \{u\}^{*}\) holds as well. This implies *x*, *y* ∈ {*u*}^{∗}, (*w*_{μ(x)}, *w*_{μ(y)}) ∈ *R*_{com}, and \(\mu \in {\Upsilon }^{R_{\text {com}}}_{x,y}(w)\). Again we can show analogously that every \(\mu \in {\Upsilon }^{R_{\text {com}}}_{x,y}(w)\) can be extended into a \(\hat {\mu }\in \hat {P}(w)\), which then proves *μ* ∈ *P*_{com}(*w*). □

In particular, this means that we can add \(\zeta ^{R_{\text {com}}}\) and \(\zeta ^{R_{\text {cyc}}}\) to core spanner representations, without leaving the class \([{\kern -2.3pt}[\textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}]{\kern -2.3pt}]\).

### Example 3.8

Define *L*_{imp} := {*w*^{n}∣*w* ∈ Σ^{+}, *n* ≥ 2} and *ρ* := \({\zeta}^{R_{\text {com}}}_{x,y}\) (*x*{Σ^{+}} ⋯ *y*{Σ^{+}}). Then \(\mathcal {L}(\rho )=L_{\text {imp}}\).

This does not imply that *R*_{com} can be used to select relations like *R*_{pow} := {(*x*, *x*^{n})∣*n* ≥ 0}. For example, if *x* := abab, then (*x*, *y*) ∈ *R*_{com} holds for all *y* ∈ {ab}^{∗}. The authors conjecture that *R*_{pow} is not selectable by core spanners.

Furthermore, the spanner that is constructed for *R*_{com} in the proof of Proposition 3.7 is more complicated than the corresponding word equation *x**y* = *y**x*. In fact, we constructed both spanners not from the equations, but from a characterization of the solutions. This appears to be necessary, due the fact that spanners need to relate their variables to an input *w*, while word equations use their variables without such restrictions. We shall see in Theorem 3.13 that, if this is kept in mind, core spanners can be used to simulate word equations.

Before we consider this topic further, we examine how word equations can simulate spanners, as this shall provide useful insights on some question of static analysis in Section 4.2. One drawback of word equations is that they are unable to express many comparatively simple regular languages; like *A*^{∗} for any non-empty *A* ⊂ *Σ*^{∗} (cf. Karhumäki et al. [26]). In order to overcome this problem, we consider the following extension:

**Definition 3.9**

Let *η* := (*η*_{L}, *η*_{R}) be a word equation. A *regular constraints function*^{1} is a function \({\mathcal {C}}\) that maps each *x* ∈ Vars(*η*) to a nondeterministic finite automaton \({\mathcal {C}}(x)\). A solution *σ* of *η* is a *solution of η under constraints*\({\mathcal {C}}\) if \(\sigma (x)\in \mathcal {L}({\mathcal {C}}(x))\) holds for every *x* ∈ Vars(*η*).

Hence, regular constraints restrict the possible substitutions of a variable *x* to a regular language \(\mathcal {L}(\mathcal {C}(x))\).

*existential theory of concatenation*, which is obtained by extending word equations with ∨, ∧, and existential quantification over variables. For example,

*R*

_{cyc}is expressed by the EC-formula

Like word equations, these formulas can be further extended by adding regular constraints. For each variable *x* and each nondeterministic finite automaton (NFA) *A*, the *(regular) constraint**L*_{A}(*x*) is satisfied for a solution *σ* if \(\sigma (x)\in \mathcal {L}(A)\). We call the resulting class of formulas EC^{reg}, the *existential theory of concatenation with regular constraints*.

### Example 3.10

Let *A* be an NFA with \(\mathcal {L}(A)=\{\mathtt {a}\mathtt {b}^{i}\mathtt {a}\mid i\geq 1\}\), and define the EC^{reg}-formula *φ*(*x*, *y*) := ∃*z*: (*L*_{A}(*z*) ∧ (∃*z*_{1}, *z*_{2}: *x* = *z*_{1}*z**z*_{2}) ∧ (∃*z*_{1}, *z*_{2}: *y* = *z*_{1}*z**z*_{2})).

Then *φ* expresses the relation of all (*x*, *y*) that have a common subword *z* from \(\mathcal {L}(A)\).

Note that we intentionally use *L*_{A}(*x*) for constraint symbols instead of \({\mathcal {C}}\), to emphasize the following distinction in the use of constraints: In word equations, every variable *x* is constrained to one language \(L({\mathcal {C}}(x))\). In contrast to this, an EC^{reg}-formula can use multiple constraint symbols for one variable (e. g., in the form of \(L_{A}(x)\land L_{A^{\prime }}(x)\)), or none at all.

Using the same techniques as for EC, one can transform EC^{reg}-formulas into equivalent word equations with regular constraints. Again, the construction can result in an exponential blowup, but satisfiability of EC^{reg}-formulas can still be decided in PSPACE (cf. Diekert [6]).

In order to simulate core spanners with EC^{reg}-formulas, we introduce the following definition:

**Definition 3.11**

Let *P* be a core spanner with SVars(*P*) = {*x*_{1}, …, *x*_{n}}, *n* ≥ 0, and let \(\varphi (x_{w},{x^{P}_{1}}, {x^{C}_{1}}, {\ldots } {x^{P}_{n}}, {x^{C}_{n}})\) be an EC^{reg}-formula. We say that *φ* realizes *P* if, for all \(w, {w^{P}_{1}}, {w^{C}_{1}},\ldots ,{w^{P}_{n}},{w^{C}_{n}}\in \Sigma ^{*}\), we have that \(\varphi (w, {w^{P}_{1}}, {w^{C}_{1}},\ldots ,{w^{P}_{n}},{w^{C}_{n}})=\mathtt {True}\) holds if and only if there is a *μ* ∈ *P*(*w*) with \({w^{P}_{k}} = w_{[1,i_{k}\rangle }\) and \({w^{C}_{k}} = w_{[i_{k},j_{k}\rangle }\) for each 1 ≤ *k* ≤ *n*, where [*i*_{k}, *j*_{k}〉 = *μ*(*x*_{k}).

This definition uses the fact that spans are always defined in relation to a word *w*. Note that every span [*i*, *j*〉 ∈ Spans(*w*) is characterized by the words *w*_{[1,i〉} and *w*_{[i,j〉}. Hence, if *μ* ∈ [[*ρ*]](*w*), the EC^{reg}-formula models *μ*(*x*_{k}) = [*i*_{k},*j*_{k}〉 by mapping *x*_{w} to *w*, \({x^{P}_{k}}\) to \(w_{[1,i_{k}\rangle }\), and \({x^{C}_{k}}\) to \(w_{[i_{k},j_{k}\rangle }\). In the naming of the variables, *C* stands for *content*, and *P* for *prefix*. This allows us to model spanners in EC^{reg}-formulas:

**Theorem 3.12**

*There is an algorithm that, given*\(\rho \in \textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}\), *computes in polynomial time an*EC^{reg}*-formula**φ*_{ρ}*that realizes* [[*ρ*]].

### Proof

Before presenting the construction that is the main part of proof, we briefly consider a technical detail of functional regex formulas. On an intuitive level, functional regex formulas guarantee that in each parse tree, every variable is assigned exactly once (hence, *x*{a} ⋅ *x*{a} is not functional). Consequently, it seems reasonable to conjecture that, if a functional regex formula contains a subformula of the form *α*_{1} ⋅ *α*_{2}, then SVars(*α*_{1}) ∩ SVars(*α*_{2}) = *∅* must hold.

While this conjecture is true for regex formulas that do not contain *∅*, it does not hold in general. For example, consider *α* := *α*_{1} ⋅ *α*_{2} with *α*_{1} := *x*{a} and *α*_{2} := (*x*{*∅*} ∨ b). Then *x* ∈ SVars(*α*_{1}) ∩ SVars(*α*_{2}), but as *x*{*∅*} can never be part of the label of a parse tree, the regex formula *α* is functional.

In order to exclude these fringe cases and simplify the construction of EC^{reg}-formulas, we introduce the following concept: A regex formula *α* is *∅*-reduced if *α* = *∅*, or if *α* does not contain any occurrence of *∅*. Using simple rewrite rules, we can observe the following. □

*Claim 1*

There is an algorithm that, given a regex formula *α*, computes in polynomial time an *∅*-reduced regex formula *α*_{R} with [[*α*_{R}]] = [[*α*]].

### Proof

*α*

_{R}, it suffices to rewrite

*α*according to the following rewrite rules:

- 1.
*∅*^{∗}→*ε*, - 2.
\((\hat {\alpha }\mathbin {\vee }\emptyset )\to \hat {\alpha }\) and \((\emptyset \mathbin {\vee }\hat {\alpha })\to \hat {\alpha }\) for all regex formulas \(\hat {\alpha }\),

- 3.
\((\hat {\alpha }\cdot \emptyset )\to \emptyset \) and \((\emptyset \cdot \hat {\alpha })\to \emptyset \) for all regex formulas \(\hat {\alpha }\),

- 4.
*x*{*∅*} →*∅*for all variables*x*.

*∅*is never part of a parse tree, we can observe that for all regex formulas

*α*and

*β*, where

*β*is obtained by applying any number of these rewrite rules, [[

*β*]] = [[

*α*]] holds. Furthermore, one can use these rules to convert

*α*into an equivalent and

*∅*-reduced

*α*

_{R}in polynomial time: If

*α*is stored in a tree structure, it suffices to apply all applicable rules in bottom-up manner. \(\square \) (Claim 1)

This allows us to proceed to the main part of the proof. Recall that our goal is a procedure that, given a \(\rho \in \textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}\) with SVars(*ρ*) = {*x*_{1}, …, *x*_{n}}, constructs an EC^{reg}-formula \(\varphi _{\rho }(x_{w},{x^{P}_{1}}, {x^{C}_{1}}, {\ldots } {x^{P}_{n}}, {x^{C}_{n}})\) such that for all \(w, {w^{P}_{1}}, {w^{C}_{1}},\ldots , {w^{P}_{n}}, {w^{C}_{n}}\in \Sigma ^{*}\), we have that \(\varphi _{\rho }(w, {w^{P}_{1}}, {w^{C}_{1}},\ldots ,{w^{P}_{n}},{w^{C}_{n}})=\mathtt {True}\) holds if and only if there is some *μ* ∈ *P*(*w*) with \({w^{P}_{k}} = w_{[1,i_{k}\rangle }\) and \({w^{C}_{k}} = w_{[i_{k},j_{k}\rangle }\) for each 1 ≤ *k* ≤ *n*, where [*i*_{k}, *j*_{k}〉 = *μ*(*x*_{k}).

*μ*is always uniquely defined by

*w*, the \({w^{P}_{k}}\), and the \({w^{C}_{k}}\). Based on this, we introduce some notation that simplifies our reasoning. Given

*w*∈

*Σ*

^{∗}and

*μ*∈

*P*(

*w*), we define the (2

*n*+ 1)-tuple \(\mathbf {w}_{\mu }:= (w, {w^{P}_{1}}, {w^{C}_{1}},\ldots ,{w^{P}_{n}},{w^{C}_{n}})\) by \({w^{P}_{k}} := w_{[1,i_{k}\rangle }\) and \({w^{C}_{k}} := w_{[i_{k},j_{k}\rangle }\) as in the previous paragraph. For the other direction, we say that a (2

*n*+ 1)-tuple \(\mathbf {w}= (w, {w^{P}_{1}}, {w^{C}_{1}},\ldots ,{w^{P}_{n}},{w^{C}_{n}})\) over

*Σ*

^{∗}is

*spanner compatible*if, for all 1 ≤

*k*≤

*n*the concatenated word \({w^{P}_{k}}\cdot {w^{C}_{k}}\) is a prefix of

*w*. In this case, we define

*μ*

_{w}through

*μ*

_{w}(

*x*

_{k}) = [

*i*

_{k},

*j*

_{k}〉 with \(i_{k}:= |{w^{P}_{k}}|+1\) and \(j_{k}:=|{w^{P}_{k}} {w^{C}_{k}}|+1\) for 1 ≤

*k*≤

*n*. Note that these are one-to-one conversions if

*w*is fixed: Every

*μ*defines its unique spanner compatible

**w**

_{μ}, and every spanner compatible

**w**defines its unique

*μ*

_{w}. We can now rephrase Definition 3.11 using this terminology, and observe that

*φ*

_{ρ}realizes [[

*ρ*]] if and only if the following two statements hold:

- 1.
For all

**w**∈ (*Σ*^{∗})^{2n+1}, we have that*φ*_{ρ}(**w**) = True implies that**w**is spanner compatible and*μ*_{w}∈*P*(*w*). - 2.
If

*μ*∈*P*(*w*), then*φ*_{ρ}(**w**_{μ}) = True.

^{reg}-formulas from regex formulas. (The following sub-proof is rather lengthy, as it contains the full induction for the correctness proof. The main part of the proof continues on page 17).

*Claim 2*

There is an algorithm that, given a functional regex formula *ρ* ∈ RGX, constructs in polynomial time an EC^{reg}-formula *φ*_{ρ} that realizes [[*ρ*]].

### Proof

*ρ*is

*∅*-reduced. We define

*φ*

_{ρ}recursively as follows:

- 1.If
*ρ*does not contain any variables (i. e.,*n*= 0),*ρ*is a proper regular expression. Using canonical transformation techniques, we can construct in polynomial time a non-deterministic finite automaton*A*with \(\mathcal {L}(A)=\mathcal {L}(\rho )\), and we defineThen$$\varphi_{\rho}(x_{w}):= L_{A}(x_{w}).$$*φ*_{ρ}realizes [[*ρ*]], as*φ*_{ρ}(*w*) = True holds if and only if \(w\in \mathcal {L}(A)=\mathcal {L}(\rho )\), which holds if and only if*μ*_{w}∈ [[*ρ*]](*w*). - 2.If
*ρ*contains variables, we assume that SVars(*ρ*) = {*x*_{1}, …,*x*_{n}} with*n*≥ 1. By definition of regex formulas, no variable of*ρ*may occur inside of a Kleene star. Hence, we can distinguish three cases:- (a)
*ρ*=*ρ*_{1}∨*ρ*_{2}, where*ρ*_{1},*ρ*_{2}are functional regex formulas with SVars(*ρ*_{1}) = SVars(*ρ*_{2}) = SVars(*ρ*). We defineThe intuition behind this formula should be clear; we proceed directly to proving the correctness. Assume that \(\varphi _{\rho _{1}}\) and \(\varphi _{\rho _{2}}\) realize [[$$\begin{array}{@{}rcl@{}} &&{}\varphi_{\rho}\left( x_{w},{x^{P}_{1}}, {x^{C}_{1}}, \ldots, {x^{P}_{n}}, {x^{C}_{n}}\right):=\\ &&\qquad\left( \varphi_{\rho_{1}}\left( x_{w},{x^{P}_{1}}, {x^{C}_{1}}, \ldots, {x^{P}_{n}}, {x^{C}_{n}}\right)\mathbin{\vee} \varphi_{\rho_{2}}\left( x_{w},{x^{P}_{1}}, {x^{C}_{1}}, \ldots, {x^{P}_{n}}, {x^{C}_{n}}\right)\right). \end{array} $$*ρ*_{1}]] and [[*ρ*_{1}]], respectively. We choose any*w*∈*Σ*^{∗}. To show the direction from logic to spanners, we extend*w*into a tuple**w**. By definition,*φ*_{ρ}(**w**) = True holds if and only if \(\varphi _{\rho _{i}}(\mathbf {w})=\mathtt {True}\) for an*i*∈ {1, 2}. As \(\varphi _{\rho _{i}}\) realizes [[*ρ*_{i}]], the tuple**w**is spanner compatible, and*μ*_{w}∈ [[*ρ*_{i}]](*w*) holds. For the other direction, we proceed analogously: If*μ*∈ [[*ρ*_{i}]](*w*), then \(\varphi _{\rho _{i}}(\mathbf {w}_{\mu })=\mathtt {True}\); hence,*φ*_{ρ}(**w**_{μ}) = True. We conclude that*φ*_{ρ}realizes [[*ρ*]]. - (b)
*ρ*=*ρ*_{1}⋅*ρ*_{2}, where*ρ*_{1},*ρ*_{2}are functional regex formulas with SVars(*ρ*_{1}) ∪ SVars(*ρ*_{2}) = SVars(*ρ*) and SVars(*ρ*_{1}) ∩ SVars(*ρ*_{2}) =*∅*. Without loss of generality, we can assumewith 0 ≤$$\begin{array}{@{}rcl@{}} {\textsf{SVars}\left( \rho_{1}\right)}&=&\{x_{1},\ldots,x_{m}\},\\ {\textsf{SVars}\left( \rho_{2}\right)}&=&\{x_{m+1},\ldots,x_{n}\} \end{array} $$*m*≤*n*. We definewhere$$\begin{array}{@{}rcl@{}} &&\varphi_{\rho}\left( x_{w},{x^{P}_{1}}, {x^{C}_{1}}, \ldots, {x^{P}_{n}}, {x^{C}_{n}}\right) :=\\ &&\exists y_{1}, y_{2}, z^{P}_{m+1}, \ldots, {z^{P}_{n}}\colon \varphi_{I}\left( x_{w},{x^{P}_{1}}, {x^{C}_{1}}, \ldots, {x^{P}_{n}}, {x^{C}_{n}},y_{1},y_{2},z^{P}_{m+1}, \ldots, {z^{P}_{n}}\right), \end{array} $$The idea behind this formula is as follows: As$$\begin{array}{@{}rcl@{}} &&\varphi_{I}(x_{w},{x^{P}_{1}}, {x^{C}_{1}}, \ldots, {x^{P}_{n}}, {x^{C}_{n}},y_{1},y_{2},z^{P}_{m+1}, \ldots, {z^{P}_{n}}):= \\ &&\qquad\qquad\qquad\left( {\vphantom{\underset{m+1\leq i \leq n}{\bigwedge}}}(x_{w} = y_{1}\cdot y_{2}) \wedge \varphi_{\rho_{1}}\left( y_{1},{x^{P}_{1}}, {x^{C}_{1}}, \ldots, {x^{P}_{m}}, {x^{C}_{m}}\right)\right.\\ &&\left.\qquad\qquad\wedge \varphi_{\rho_{2}}\left( y_{2},z^{P}_{m+1}, x^{C}_{m+1}, \ldots, {z^{P}_{n}}, {x^{C}_{n}}\right) \!\wedge \underset{m+1\leq i \leq n}{\bigwedge} \left( {x^{P}_{i}} = y_{1} \cdot {z^{P}_{i}}\right)\right). \end{array} $$*ρ*=*ρ*_{1}⋅*ρ*_{2}, whenever [[*ρ*]](*w*) ≠*∅*holds,*w*can be decomposed into*w*=*w*_{1}⋅*w*_{2}, where*w*_{1}is parsed in*ρ*_{1}, and*w*_{2}in*ρ*_{2}. We store these words in the variables*y*_{1}and*y*_{2}, respectively. For all variables in SVars(*ρ*_{1}), the spans of the*μ*∈ [[*ρ*_{1}]](*w*_{1}) are also spans in*w*(as*w*_{1}is a prefix of*w*). Hence, we can use the results from*ρ*_{1}unchanged. On the other hand, [[*ρ*_{2}]](*w*_{2}) determines spans in relation to*w*_{2}. Hence, each span [*i*,*j*〉 ∈ Spans(*w*_{2}) corresponds to the span [*i*+*c*,*j*+*c*〉 ∈ Spans(*w*), where*c*:= |*w*_{1}|. The variables \({z^{P}_{i}}\) represent the start of the span with respect to*y*_{2}, and the conjunction of the equations \(({x^{P}_{i}} = y_{1} \cdot {z^{P}_{i}})\) converts these starts into spans with respect to*x*_{w}.The correctness proof is a little lengthy, but straightforward. Assume that \(\varphi _{\rho _{1}}\) and \(\varphi _{\rho _{2}}\) realize [[*ρ*_{1}]] and [[*ρ*_{2}]]. Assume that*φ*_{ρ}(**w**) = True for some tuple \(\mathbf {w}=(w,{w^{P}_{1}},{w^{C}_{1}},\ldots ,{w^{P}_{n}},{w^{C}_{n}})\). By definition of*φ*_{ρ}, the tuple**w**can be extended into \(\mathbf {w}^{\prime }=(w,{w^{P}_{1}},{w^{C}_{1}},\ldots ,{w^{P}_{n}},{w^{C}_{n}},u_{1},u_{2},v^{P}_{m+1},\ldots ,{v^{P}_{n}})\) with*φ*_{I}(**w**′) = True. By observing the structure of*φ*_{I}, we obtain:From this and our initial assumption, we can conclude that- i.
*w*=*u*_{1}⋅*u*_{2}, - ii.
\({w^{P}_{i}} = u_{1} \cdot {v^{P}_{i}}\) for

*m*+ 1 ≤*i*≤*n*, - iii.\(\varphi _{\rho _{1}}(\mathbf {u}_{1})=\mathtt {True}\) and \(\varphi _{\rho _{2}}(\mathbf {u}_{2})=\mathtt {True}\), where$$\begin{array}{@{}rcl@{}} \mathbf{u}_{1} &:=& \left( u_{1}, {w^{P}_{1}}, {w^{C}_{1}}, \ldots, {w^{P}_{m}}, {w^{C}_{m}}\right),\\ \mathbf{u}_{2} &:=& \left( u_{2}, v^{P}_{m+1}, w^{C}_{m+1}, \ldots, {v^{P}_{n}}, {w^{C}_{n}}\right). \end{array} $$

**w**is spanner compatible, and that \(\mu _{\mathbf {u}_{1}}\in [{\kern -2.3pt}[ \rho _{1} ]{\kern -2.3pt}](u_{1})\) and \(\mu _{\mathbf {u}_{2}}\in [{\kern -2.3pt}[ \rho _{2} ]{\kern -2.3pt}](u_{2})\) must hold. Thus, there exits corresponding parse trees*T*_{1}and*T*_{2}with respective root labels (*u*_{1},*ρ*_{1}) and (*u*_{2},*ρ*_{2}). We combine these into a new parse tree*T*by adding a new root node (*w*,*ρ*_{1}⋅*ρ*_{2}) that has*T*_{1}as left and*T*_{2}as right child. As described in Definition 2.11, this tree*T*defines the*w*-tupleIn other words, for the variables$$\mu^{T}(x_{k})\,=\,\left\{\begin{array}{ll}\left[\right.i_{k},j_{k}\rangle & \text{ if } 1\leq k\leq m \text{ and } \mu_{1}(x_{k})=\left[\right.i_{k},j_{k}\rangle,\\ \left[\right.i_{k}\,+\,|u_{1}|,j_{k}+|u_{1}|\rangle & \text{ if } m+1\leq k\leq n \text{ and } \mu_{2}(x_{k})=\left[\right.i_{k},j_{k}\rangle. \end{array}\right. $$*x*_{1}to*x*_{m}, the*w*-tuple*μ*^{T}simulates*μ*_{1}in*u*_{1}, the left part of*w*; and for the variables*x*_{m+1}to*x*_{n}, it simulates*μ*_{2}in*u*_{2}, the right part of*w*. Hence, all spans for the latter variables are shifted by |*u*_{1}|. Using the equalities \({w^{P}_{i}} = u_{1} \cdot {v^{P}_{i}}\) from above, we obtain*μ*^{T}=*μ*_{w}, which concludes this direction of the correctness proof. The other direction proceeds analogously: Given*μ*∈ [[*ρ*]], we can use the corresponding parse tree*T*to factorize*w*into*u*_{1}and*u*_{2}. We then shift the spans of the variables*x*_{m+1}to*x*_{n}by |*u*_{1}|, and use this to obtain**u**_{2}with \(\varphi _{\rho _{2}}(\mathbf {u}_{2})=\mathtt {True}\). No effort is necessary for**u**_{1}, and we can then combine**u**_{1}and**u**_{2}into a tuple**w**with*φ*_{ρ}(**w**) = True and**w**=**w**_{μ}. Thus,*φ*_{ρ}realizes [[*ρ*]]. - i.
- (c)\(\rho = x\{\hat {\rho }\}\) for some
*x*∈ {*x*_{1}, …,*x*_{n}}, and \(\hat {\rho }\) is a functional regex formula with \(\textsf{SVars}(\hat {\rho }) = \textsf{SVars}(\rho )\setminus \{x\}\). Without loss of generality, let*x*=*x*_{1}. We defineThe formula uses the fact that in this case, for each$$\begin{array}{@{}rcl@{}} &&\varphi_{\rho}(x_{w},{x^{P}_{1}}, {x^{C}_{1}}, \ldots, {x^{P}_{n}}, {x^{C}_{n}}) :=\\ &&\qquad\qquad\quad\left( \left( {x^{P}_{1}}=\varepsilon\right) \!\wedge\! \left( {x^{C}_{1}} = x_{w}\right) \!\wedge \varphi_{\hat{\rho}}\left( x_{w},{x^{P}_{2}}, {x^{C}_{2}}, \ldots, {x^{P}_{n}}, {x^{C}_{n}}\right)\right). \end{array} $$*μ*∈ [[*ρ*]](*w*), we have that*μ*(*x*_{1}) = [1,|*w*| + 1〉 must hold. This is encoded by \({x^{P}_{1}}=\varepsilon \) and \({x^{C}_{1}} = w\). For the correctness proof, assume that \(\varphi _{\hat {\rho }}\) realizes \([{\kern -2.3pt}[ \hat {\rho } ]{\kern -2.3pt}]\). Going from logic to spanners, assume that \(\mathbf {w}=(w,{w^{P}_{1}},{w^{C}_{1}},\ldots ,{w^{P}_{n}},{w^{C}_{n}})\) and*φ*_{ρ}(**w**) = True. Due to the structure of the formula, we know that \({w^{P}_{1}} =\varepsilon \), \({w^{C}_{1}}=w\), and \(\varphi _{\hat {\rho }}(\hat {\mathbf {w}})=\mathtt {True}\) for \(\hat {\mathbf {w}}=(w,{w^{P}_{2}},{w^{C}_{2}},\ldots ,{w^{P}_{n}},{w^{C}_{n}})\). As \(\varphi _{\hat {\rho }}\) realizes \([{\kern -2.3pt}[ \hat {\rho } ]{\kern -2.3pt}]\), we know that \(\hat {\mathbf {w}}\) is spanner compatible, and \(\mu _{\hat {\mathbf {w}}}\in [{\kern -2.3pt}[ \hat {\rho } ]{\kern -2.3pt}](w)\). Due to this and the definition of*ρ*, we observe*μ*∈ [[*ρ*]](*w*) for the*w*-tupleAs$$\mu(x_{k}):= \left\{\begin{array}{ll} \left[\right.1,|w|+1\rangle & \text{if } k=1,\\ \mu_{\hat{\mathbf{w}}}(x_{k}) & \text{if } k>1. \end{array}\right. $$*μ*=*μ*_{w}, we conclude this direction of the proof. For the other direction, let*μ*∈ [[*ρ*]](*w*). By definition,*μ*(*x*_{1}) = [1,|*w*| + 1〉 and \(\hat {\mu }\in [{\kern -2.3pt}[ \hat {\rho } ]{\kern -2.3pt}]\) for \(\hat {\mu }=\mu |_{\{x_{2},\ldots ,x_{n}\}}\). Due to our initial assumption, \(\varphi _{\hat {\rho }}(\mathbf {w}_{\hat {\mu }})=\mathtt {True}\) must hold. Note that \(\mathbf {w}_{\hat {\mu }}=(w,{w^{P}_{2}},{w^{C}_{2}},\ldots ,{w^{P}_{n}},{w^{C}_{n}})\), and let \(\mathbf {w}:= (w,\varepsilon , w, {w^{P}_{2}},{w^{C}_{2}},\ldots ,{w^{P}_{n}},{w^{C}_{n}})\). Then*φ*_{ρ}(**w**) = True; and as**w**=**w**_{μ}, this concludes this direction. Thus,*φ*_{ρ}realizes [[*ρ*]].

- (a)

*φ*

_{ρ}is polynomial in the size of

*ρ*. More importantly, the construction of

*φ*

_{ρ}follows the syntax of

*ρ*, and does not requires expensive additional computations. Hence,

*φ*

_{ρ}can be computed in polynomial time. \(\square \) (Claim 2)

*ρ*) = {

*x*

_{1}, …,

*x*

_{n}},

*n*≥ 0. We distinguish the following cases:

- 1.
*ρ*is a regex formula. This case is covered in Claim 2. - 2.\(\rho = \pi _{Y} \hat {\rho }\), with
*Y*= SVars(*ρ*) and \({\textsf{SVars}\left (\hat {\rho }\right )}\supseteq {\textsf{SVars}\left (\rho \right )}\). Assume without loss of generality that \(\textsf{SVars}(\hat {\rho })=\{x_{1},\ldots ,x_{n+m}\}\) with*m*≥ 0. We defineRegarding the correctness, assume that \(\varphi _{\hat {\rho }}\) realizes \([{\kern -2.3pt}[ \hat {\rho } ]{\kern -2.3pt}]\). Hence, if \(\hat {\mu }\in [{\kern -2.3pt}[ \hat {\rho } ]{\kern -2.3pt}](w)\), we have \(\varphi _{\hat {\rho }}(\mathbf {w}_{\hat {\mu }})=\mathtt {True}\). This means that for \(\mu := \hat {\mu }|_{Y}\), we know that$$\begin{array}{@{}rcl@{}} &&\varphi_{\rho}\left( x_{w},{x^{P}_{1}}, {x^{C}_{1}}, \ldots, {x^{P}_{n}}, {x^{C}_{n}}\right) :=\\ &&\qquad\qquad\exists x^{P}_{n+1},x^{C}_{n+1},\ldots,x^{P}_{n+m},x^{C}_{n+m}\colon \varphi_{\hat{\rho}}\left( x_{w},{x^{P}_{1}}, {x^{C}_{1}}, \ldots, x^{P}_{n+m}, x^{C}_{n+m}\right) \end{array} $$*φ*_{ρ}(**w**_{μ}) = True holds as well. Likewise, if*φ*_{ρ}(**w**) = True, there exists an extension \(\hat {\mathbf {w}}\) of**w**with \(\mu _{\hat {\mathbf {w}}}\in [{\kern -2.3pt}[ \hat {\rho } ]{\kern -2.3pt}](w)\). As \(\hat {\mathbf {w}}\) is spanner compatible, so is**w**. Thus, we observe \(\mu _{\mathbf {w}}=\mu _{\hat {\mathbf {w}}}|_{Y}\) and*μ*_{w}∈ [[*ρ*]](*w*). Hence,*φ*_{ρ}realizes [[*ρ*]]. - 3.\(\rho = \zeta ^=_{\mathbf {x}} \hat {\rho }\), with \(\mathbf {x}\in ({\textsf{SVars}\left (\hat {\rho }\right )})^{m}\), 2 ≤
*m*≤*n*, and \(\textsf{SVars}(\rho )=\textsf{SVars}(\hat {\rho })\). Assume without loss of generality that**x**= (*x*_{1}, …,*x*_{m}). We defineRecall that \(\zeta ^=_{x_{i},x_{j}}\) only checks whether \(w_{\mu (x_{i})}=w_{\mu (x_{j})}\) holds, not whether$$\begin{array}{@{}rcl@{}} &&\varphi_{\rho}\left( x_{w},{x^{P}_{1}}, {x^{C}_{1}}, \ldots, {x^{P}_{n}}, {x^{C}_{n}}\right) :=\\ &&\qquad\qquad\qquad\qquad\quad\left( \varphi_{\hat{\rho}}\left( x_{w},{x^{P}_{1}}, {x^{C}_{1}}, \ldots, {x^{P}_{n}}, {x^{C}_{n}}\right) \!\wedge\! \underset{2\leq i \leq m}{\bigwedge} \left( {x^{C}_{1}} \,=\, {x^{C}_{i}}\right)\right). \end{array} $$*μ*(*x*_{i}) =*μ*(*x*_{j}). This is equivalent to checking whether \({x^{C}_{i}}={x^{C}_{j}}\) holds.We only proof the correctness for

*m*= 2, the other cases proceed analogously (or by reducing them to this binary case). Assume that \(\varphi _{\hat {\rho }}\) realizes \([{\kern -2.3pt}[ \hat {\rho } ]{\kern -2.3pt}]\). Let*μ*∈ [[*ρ*]](*w*). Then \(w_{\mu (x_{1})}=w_{\mu (x_{2})}\) and \(\mu \in [{\kern -2.3pt}[ \hat {\rho } ]{\kern -2.3pt}](w)\) hold by definition. The latter implies \(\varphi _{\hat {\rho }}(\mathbf {w})=\mathtt {True}\). Together with the former and the structure of*φ*_{ρ}, we conclude*φ*_{ρ}(**w**) = True.For the other direction, let

*φ*_{ρ}(**w**) = True. By the structure of*φ*_{ρ}, we know that \(\varphi _{\hat {\rho }}(\mathbf {w})=\mathtt {True}\) and \({w^{C}_{1}}={w^{C}_{2}}\). As \(\varphi _{\hat {\rho }}\) realizes \([{\kern -2.3pt}[ \hat {\rho } ]{\kern -2.3pt}]\), we have that**w**is spanner compatible, and \(\mu _{\mathbf {w}}\in [{\kern -2.3pt}[ \hat {\rho } ]{\kern -2.3pt}](w)\). Due to \({w^{C}_{1}}={w^{C}_{2}}\), this implies*μ*_{w}∈ [[*ρ*]](*w*) and concludes the proof that*φ*_{ρ}realizes [[*ρ*]]. - 4.
*ρ*= (*ρ*_{1}∪*ρ*_{2}), with SVars(*ρ*_{1}) = SVars(*ρ*_{2}) = SVars(*ρ*). LetIn this case, the construction and the correctness proof are identical to case 2a (disjunction) in the proof of Claim 2.$$\begin{array}{@{}rcl@{}} &&\varphi_{\rho}\left( x_{w},{x^{P}_{1}}, {x^{C}_{1}}, \ldots, {x^{P}_{n}}, {x^{C}_{n}}\right) :=\\ &&\qquad\qquad\left( \varphi_{\rho_{1}}\left( x_{w},{x^{P}_{1}}, {x^{C}_{1}}, \ldots, {x^{P}_{n}}, {x^{C}_{n}}\right) \mathbin{\!\vee\!} \varphi_{\rho_{2}}\left( x_{w},{x^{P}_{1}}, {x^{C}_{1}}, \ldots, {x^{P}_{n}}, {x^{C}_{n}}\right)\right). \end{array} $$ - 5.
*ρ*= (*ρ*_{1}⋈*ρ*_{2}) with SVars(*ρ*) = SVars(*ρ*_{1}) ∪ SVars(*ρ*_{2}). We assume without loss of generality that SVars(*ρ*_{1}) = {*x*_{1}, …,*x*_{l}} and SVars(*ρ*_{2}) = {*x*_{m}, …,*x*_{n}} with 0 ≤*l*≤*n*, 1 ≤*m*≤*n*+ 1, and*m*≤*l*+ 1. Note that this implies SVars(*ρ*_{1}) ∩ SVars(*ρ*_{2}) = {*x*_{m}, …,*x*_{l}}, and SVars(*ρ*_{1}) ∩ SVars(*ρ*_{2}) =*∅*if and only if*m*=*l*+ 1. We defineThe definition of ⋈ requires that$$\begin{array}{@{}rcl@{}} &&\varphi_{\rho}\left( x_{w},{x^{P}_{1}}, {x^{C}_{1}}, \ldots, {x^{P}_{n}}, {x^{C}_{n}}\right) :=\\ &&\qquad\qquad\left( \varphi_{\rho_{1}}\left( x_{w},{x^{P}_{1}}, {x^{C}_{1}}, \ldots, {x^{P}_{l}}, {x^{C}_{l}}\right) \!\wedge\! \varphi_{\rho_{2}}\left( x_{w},{x^{P}_{m}}, {x^{C}_{m}}, \ldots, {x^{P}_{n}}, {x^{C}_{n}}\right)\right). \end{array} $$*μ*∈ [[*ρ*]](*w*) holds if and only if there are*μ*_{1}∈ [[*ρ*_{1}]](*w*) and*μ*_{2}∈ [[*ρ*_{2}]](*w*) with*μ*_{1}(*x*_{i}) =*μ*_{2}(*x*_{i}) for all*i*∈ {*m*, …,*l*}. For each of these variables*x*_{i}, we have that \(\varphi _{\rho _{1}}\) and \(\varphi _{\rho _{2}}\) model the span with the same variables \({x^{P}_{i}}\) and \({x^{C}_{i}}\).To prove the correctness, assume that \(\varphi _{\rho _{1}}\) and \(\varphi _{\rho _{2}}\) realize [[

*ρ*_{1}]] and [[*ρ*_{2}]], respectively. Let*μ*∈ [[*ρ*]](*w*). Then there exist*μ*_{1}∈ [[*ρ*_{1}]](*w*) and*μ*_{2}∈ [[*ρ*_{2}]](*w*) with \(\mu _{1}=\mu |_{\{x_{1},\ldots ,x_{l}\}}\) and \(\mu _{2}=\mu |_{\{x_{m},\ldots ,x_{n}\}}\), which implies*μ*_{1}(*x*_{k}) =*μ*_{2}(*x*_{k}) for*m*≤*k*≤*l*. Now, in order to talk about the components of \( \mathbf {w}_{\mu _{1}}\) and \( \mathbf {w}_{\mu _{2}}\), we name the components of the tuples as \(\mathbf {w}_{\mu _{1}}=(w,{w^{P}_{1}},{w^{C}_{1}},\ldots ,{w^{P}_{l}},{w^{C}_{l}})\) and \(\mathbf {w}_{\mu _{2}}=(w,{w^{P}_{m}},{w^{C}_{m}},\ldots ,{w^{P}_{n}},{w^{C}_{n}})\). As*μ*_{1}and*μ*_{2}agree on their common variables, we can combine this to \(\mathbf {w}:= (w,{w^{P}_{1}},{w^{C}_{1}},\ldots ,{w^{P}_{n}},{w^{C}_{n}})=\mathbf {w}_{\mu }\). As each \(\varphi _{\rho _{i}}\) realizes [[*ρ*_{i}]], we know that \(\varphi _{\rho _{i}}(\mathbf {w}_{\mu _{i}})=\mathtt {True}\). Hence,*φ*_{ρ}(**w**_{μ}) =*φ*_{ρ}(**w**) = True. This concludes this direction.For the other direction, assume that

*φ*_{ρ}(**w**) = True. Due to the structure of the formula, this implies \(\varphi _{\rho _{i}}(\mathbf {w}_{i})=\mathtt {True}\), where \(\mathbf {w}_{1}:= (w,{w^{P}_{1}},{w^{C}_{1}},\ldots ,{w^{P}_{l}},{w^{C}_{l}})\) and \(\mathbf {w}_{2}:= (w,{w^{P}_{m}},{w^{C}_{m}},\ldots ,{w^{P}_{n}},{w^{C}_{n}})\). As \(\varphi _{\rho _{i}}\) realizes [[*ρ*_{i}]], we know that**w**_{i}is spanner compatible, and \(\mu _{\mathbf {w}_{i}}\in [{\kern -2.3pt}[ \rho _{i} ]{\kern -2.3pt}](w)\) . Due to the former,**w**is also spanner compatible. Due to the latter, we know that*μ*_{w}∈ [[*ρ*]](*w*), as \(\mu _{\mathbf {w}}(x_{k})=\mu _{\mathbf {w}_{1}}(x_{k})=\mu _{\mathbf {w}_{2}}(x_{k})\) for all*m*≤*k*≤*l*. Hence,*φ*_{ρ}realizes [[*ρ*]].

*φ*

_{ρ}can be derived from

*ρ*without requiring further computation, and its size is polynomial in the size of

*ρ*. Hence,

*φ*

_{ρ}can be constructed in polynomial time.

As we shall see in Section 4.2, this result allows us to find upper bounds on two problems from the static analysis of spanners. We now examine how spanners can simulate word equations (and, thereby, also EC^{reg}-formulas). As discussed above, spanners need to relate their variables to an input word. Hence, we only state the following result, which is a weaker form of simulation than for the other direction:

**Theorem 3.13**

*Every word equation η* := (*η*_{L}, *η*_{R}) *with regular constraints*\({\mathcal {C}}\)*can be converted effectively into a**ρ* ∈ RGX\(\textsf{RGX}^{\{\zeta ^=,\times \}}\)*with*SVars(*ρ*) ⊇ Vars(*η*) *such that for all**w* ∈ *Σ*^{∗}, *there is a solution σ of η under constraints*\(\mathcal {C}\)*with**w* = *σ*(*η*_{L}) = *σ*(*η*_{R}) *if and only if there is a**μ* ∈ [[*ρ*]](*w*) *with**σ*(*x*) = *w*_{μ(x)}*for all**x* ∈ Vars(*η*).

### Proof

As each of the two sides of a word equation is a pattern, we can transform those into regex formulas by using the a slightly adapted version of the conversion procedure from the proof of Theorem 3.3. Only two changes are made. Firstly, instead of binding a variable x to some *Σ*^{∗}, we respect the constraints by using a regular expression for the language \(\mathcal {L}({\mathcal {C}}(x))\). Secondly, in order to ensure SVars(*ρ*) ⊇ Vars(*η*), the first occurrence of a variable *x* is not represented by *x*_{1}, but by *x*.

*η*

_{L}=

*α*

_{1}⋯

*α*

_{m}and

*η*

_{R}=

*α*

_{m+1}⋯

*α*

_{n}with \(m,n\in \mathbb {N}\),

*m*+ 1 ≤

*n*, and

*α*

_{1}, …,

*α*

_{n}∈ (

*Σ*∪

*X*). We construct regex formulas \(\hat {\eta }_{L}:= \hat {\alpha }_{1}{\cdots } \hat {\alpha }_{m}\) and \(\hat {\eta }_{R}:= \hat {\alpha }_{m+1}{\cdots } \hat {\alpha }_{n}\), where for each position in 1 ≤

*i*≤

*n*, we define \(\hat {\alpha }_{i}\) as follows:

- 1.
If

*α*_{i}is a terminal (i. e., there is an*a*∈*Σ*with*α*_{i}=*a*), let \(\hat {\alpha }_{i}:= a\). - 2.If
*α*_{i}is a variable (i. e., there is an*x*∈*X*with*α*_{i}=*x*), let*γ*be a regular expression with \(\mathcal {L}(\gamma )=\mathcal {L}(\mathcal {C}(x))\). Furthermore, let*j*:= |*α*_{1}⋯*α*_{i}|_{x}.- (a)
If

*j*= 1, define \(\hat {\alpha }_{i}:= x\{\gamma \}\) - (b)
If

*j*≥ 2, define \(\hat {\alpha }_{i}:= x_{j}\{\gamma \}\) (where*x*_{j}∈ SVars is a new variable).

- (a)

*S*of string equality selections appropriately: For every

*x*∈ Vars(

*η*) with

*k*:= |

*η*

_{L}

*η*

_{R}|

_{x}≥ 2, the sequence

*S*includes a selection \(\zeta ^=_{x,x_{2},\ldots ,x_{k}}\).

Finally, we define \(\rho := S(\hat {\eta }_{L}\times \hat {\eta }_{R})\).

*w*∈

*Σ*

^{∗},

*μ*∈ [[

*ρ*]](

*w*) holds if and only if there is a solution

*σ*of

*η*under constraints \({\mathcal {C}}\) with

- 1.
*w*=*σ*(*η*_{L}) =*σ*(*η*_{R}), and - 2.
*σ*(*x*) =*w*_{μ(x)}for all*x*∈ Vars(*η*).

*if*-direction. Assume that

*σ*is a solution of

*η*under constraints \({\mathcal {C}}\). Let

*w*:=

*σ*(

*η*

_{L}) (which implies

*w*=

*σ*(

*η*

_{R}), as

*σ*is a solution of

*η*). We use this to define a

*w*-tuple

*μ*as follows: Due to our construction, each variable \(\hat {x}\in {\textsf{SVars}\left (\rho \right )}\) corresponds to a uniquely defined

*α*

_{i}with

*α*

_{i}=

*x*. If 1 ≤

*i*≤

*m*, then \(\hat {x}\) occurs in \(\hat {\eta _{L}}\), and if

*m*+ 1 ≤

*i*≤

*n*, then \(\hat {x}\) occurs in \(\hat {\eta _{R}}\). We now define \(\mu (\hat {x}):= [l,r\rangle \), where the choice of

*l*and

*r*depends on this distinction:

If \(\hat {x}\) occurs in \(\hat {\eta _{L}}\), let

*l*:= |*σ*(*α*_{1}⋯*α*_{i−1})| + 1 and*r*:= |*σ*(*α*_{1}⋯*α*_{i})| + 1,If \(\hat {x}\) occurs in \(\hat {\eta _{R}}\), let

*l*:= |*σ*(*α*_{m+1}⋯*α*_{i−1})| + 1 and*r*:= |*σ*(*α*_{m+1}⋯*α*_{i})| + 1.

Either way, we know that \(w_{\mu (\hat {x})}=\sigma (x)\) holds, which implies \(w_{\mu (\hat {x})}\in \mathcal {L}({\mathcal {C}}(x))\). Analogously, we can use *σ* to construct parse trees for \((w,\hat {\eta }_{L})\) and \((w,\hat {\eta }_{R})\). This allows us to conclude \(\mu \in [{\kern -2.3pt}[ \hat {\eta }_{L}\times \hat {\eta }_{R} ]{\kern -2.3pt}](w)\). Furthermore, for every selection \(\zeta ^=_{x,x_{2},\ldots ,x_{k}}\) in *S*, we know from the construction that *x* and all *x*_{i} (1 ≤ *i* ≤ *k*) refer to the same *x* ∈ Vars(*η*), which means that \(w_{\mu (x)}=w_{\mu (x_{i})}=\sigma (x)\) holds. Hence, for each of these selections, \(\mu \in [{\kern -2.3pt}[ \hat {\eta }_{L}\times \hat {\eta }_{R} ]{\kern -2.3pt}](w)\) implies \(\mu \in [{\kern -2.3pt}[ \zeta ^=_{x,x_{2},\ldots ,x_{k}}(\hat {\eta }_{L}\times \hat {\eta }_{R}) ]{\kern -2.3pt}](w)\). Thus, \(\mu \in [{\kern -2.3pt}[ S(\hat {\eta }_{L}\times \hat {\eta }_{R}) ]{\kern -2.3pt}](w)\), which is equivalent to *μ* ∈ [[*ρ*]](*w*) and concludes this direction of the proof.

*only if*-direction, assume that

*μ*∈ [[

*ρ*]](

*w*). We now define a pattern substitution

*σ*by

*σ*(

*a*) :=

*a*for all

*a*∈

*Σ*, and

*σ*(

*x*) :=

*w*

_{μ(x)}for all

*x*∈ Vars(

*η*). By our construction,

*μ*(

*x*) is derived from

*x*{

*γ*}, where \(\mathcal {L}(\gamma )=\mathcal {L}({\mathcal {C}}(x))\) must hold, which means that \(w_{\mu (x)}\in \mathcal {L}({\mathcal {C}}(x))\), and hence \(\sigma (x)\in \mathcal {L}({\mathcal {C}}(x))\). All that remains to be shown is that

*σ*(

*η*

_{L}) =

*σ*(

*η*

_{R}) =

*w*. In order to prove this, we first define \(\hat {w}_{L} = \hat {w}_{1}{\cdots } \hat {w}_{m}\) and \(\hat {w}_{R} = \hat {w}_{m+1}{\cdots } \hat {w}_{n}\), where the \(\hat {w}_{i}\) with 1 ≤

*i*≤

*n*are defined as follows:

- 1.
If

*α*_{i}=*a*∈*Σ*, let \(\hat {w}_{i}:= a\). Then \(\hat {w}_{i}=\hat {\alpha }_{i}\) and \(\hat {w}=\sigma (\alpha _{i})\) hold by definition. - 2.If
*α*_{i}=*x*∈*X*, let*j*:= |*α*_{1}⋯*α*_{i}|_{x}. We distinguish two cases.- (a)
If

*j*= 1, let \(\hat {w}_{i} = w_{\mu (x)}\). Then \(\sigma (\alpha _{i})=\hat {w}_{i}\) holds by definition. - (b)
If

*j*≥ 2, let \(\hat {w}_{i} = w_{\mu (x_{j})}\). Observe that*S*contains the selection \(\zeta ^=_{x,x_{2},\ldots ,x_{k}}\). Hence, \(w_{\mu (x_{j})}=w_{\mu (x)}\) holds, which implies \(\sigma (\alpha _{i})=\hat {w}_{i}\).

- (a)

*i*≤

*m*. This allows us to conclude

*σ*(

*η*

_{L}) =

*σ*(

*η*

_{R}) =

*w*, which concludes this direction of the proof. □

While this form of simulation is weaker (as *w* has to be present), it still shows that the constructed spanner is satisfiable if and only if the word equation (with constraints) is satisfiable. Furthermore, the computed (*V*, *w*)-relations encode solutions of the equation.

*Example 3.14*

*Σ*and define

*η*:= (

*x*

*y*,

*y*

*x*) with \(\mathcal {L}({\mathcal {C}}(x))=\mathcal {L}(\mathtt {aab^{+}})\) and \(\mathcal {L}({\mathcal {C}}(y))=\Sigma ^{+}\). The construction from the proof of Theorem 3.13 results in

The only reason that this construction is not necessarily possible in polynomial time is that regular constraints are specified with NFAs, while core spanners use regular expressions, which can lead to an exponential increase in the size.

*z*

_{1},

*z*

_{2}, we can construct

*ρ*; the only difference that the solution is encoded in

*w*=

*σ*(

*η*

_{L}⋅

*η*

_{R}), instead of

*σ*(

*η*

_{L}).

### 3.3 Xregexes

As shown by Fagin et al. [7], there are languages that are recognized by xregexes, but not by core spanners. In order to prove this, [7] introduced the so-called “uniform-0-chunk”-language *L*_{uzc}: Assuming 0, 1 ∈ *Σ*, *L*_{uzc} is defined as the language of all *w* = *s*_{1} ⋅ *t* ⋅ *s*_{2} ⋅ *t*⋯*s*_{n−1} ⋅ *t* ⋅ *s*_{n}, where *n* > 0, *s*_{1}, …, *s*_{n} ∈ {1}^{+}, and *t* ∈ {0}^{+}. Then \(\mathcal {L}(\alpha _{\text {uzc}})=L_{\text {uzc}}\) holds for the xregex *α*_{uzc} := 1^{+} ⋅ *x*{0^{∗}} ⋅ (1^{+} ⋅& *x*)^{∗} ⋅ 1^{+}, but no core spanner recognizes *L*_{uzc}.

Considering that the syntax of regex formulas does not allow the use of variables inside a Kleene star (or plus), this inexpressibility result might be considered expected, as *α*_{uzc} has an occurrence of &*x* inside a Kleene star. This raises the question whether xregexes that restrict variables in a similar manner can still recognize languages that core spanners cannot. In order to examine this question, we define the following subclass of xregexes:

**Definition 3.15**

An xregex *α* is *variable star-free* (short: *vstar-free*) if, for every *β* ∈ Sub(*α*) with *β* = *γ*^{∗}, no subexpression of *γ* is a variable binding or a variable reference. We denote the class of all vstar-free xregexes by vsfXR.

As we shall see in Theorem 3.21 below, every language that is recognized by a vstar-free xregex is also recognized by a core spanner. While this observation might be considered not very surprising, its proof needs to deal with some technicalities. In particular, one needs to deal with expressions like *α* := *x*{*Σ*^{∗}}⋅ (&*x* ∨& *x*&*x*). A conversion in the spirit of Theorem 3.3 would need to replace the &*x* with distinct variables and ensure equality with selections; but as the disjunction contains subexpressions with distinct numbers of occurrences of &*x*, we would not be able to ensure functionality of the resulting regex formula. We avoid these problems by working with the following syntactically restricted class of vstar-free xregexes:

**Definition 3.16**

An *α* ∈ vsfXR is an *xregex path* if, for every *β* ∈ Sub(*α*) with *β* = (*γ*_{1} ∨ *γ*_{2}), no subexpression of *γ*_{1} or *γ*_{2} is a variable binding or a variable reference. We denote the class of all xregex paths by XRP.

Intuitively, an xregex path *α* ∈ XRP can be understood as a concatenation *α* = *α*_{1}⋯*α*_{n}, where each *α*_{i} is either a proper regular expression, a variable reference, or a variable binding of the form \(\alpha _{i} = x\{\hat {\alpha }\}\), where \(\hat {\alpha }\) is also an xregex path. By “multiplying out” disjunctions that contain variables, we can convert every vstar-free xregex into a disjunction of xregex paths.

**Lemma 3.17**

*There is an algorithm that, given**α* ∈ vsfXR, *computes**α*_{1}, …, *α*_{n} ∈ XRP*with*\(\mathcal {L}(\alpha )=\bigcup _{i=1}^{n}\mathcal {L}(\alpha _{i})\).

### Proof

*α*is not an xregex path, there exists at least one

*x*∈ Vars(

*α*) and at least one subexpression

*β*∈ Sub(

*α*) with

*β*≠

*α*such that

- 1.
*β*is a disjunction; i. e.,*β*= (*γ*_{1}∨*γ*_{2}) for some*γ*_{1},*γ*_{2}∈ vsfXR, - 2.
*β*contains a variable binding*x*{⋯} or a variable reference &*x*.

*α*into two vstar-free xregexes

*α*

_{1}and

*α*

_{2}, by replacing

*β*with

*γ*

_{1}or

*γ*

_{2}, respectively. We observe that this rewriting step does not change the language: □

*Claim 1*

\(\mathcal {L}(\alpha )=\mathcal {L}(\alpha _{1})\cup \mathcal {L}(\alpha _{2})\)

### Proof

If \(w\in \mathcal {L}(\alpha )\), there exists an *α*-parse tree *T* for *w*; in other words, the root of *T* is labelled with (*w*, *α*). Recall that *α* is vstar-free. Hence, we know that *T* uses the occurrence of *β* that was rewritten to create *α*_{1} and *α*_{2} at most once (in order to be able to use the occurrence multiple times, *α* would need to contain a star around *β*).

This allows us to distinguish two possibilities: If *T* does not use this occurrence of *β* at all, we can immediately transform *T* into an *α*_{i}-parse tree *T*_{i} (*i* ∈ {1, 2}) by replacing the root label with (*w*, *α*_{i}), and changing all children accordingly. Hence, \(w\in \mathcal {L}(\alpha _{i})\) holds.

On the other hand, if *T* uses this occurrence of *β*, then there exists a uniquely defined node *v* in *T* that is labeled with \((\hat {w},\beta )\) for some word \(\hat {w}\in \Sigma ^{*}\). Furthermore, this node corresponds to the occurrence of *β* that was rewritten in *α*_{1} and *α*_{2}. By definition, *v* has exactly one child \(\hat {v}\) that is labeled with either \((\hat {w},\gamma _{i})\), where *i* ∈ {1, 2}. We rewrite *T* into a *α*_{i}-parse tree *T*_{i} by removing *v* (i. e., \(\hat {v}\) replaces *v*), relabeling the root of *T* to (*w*, *α*_{i}), and changing all labels between the root and \(\hat {v}\) accordingly. As *T*_{i} is a *α*_{i}-parse tree for *w*, we have that \(w\in \mathcal {L}(\alpha _{i})\) holds. This proves \(\mathcal {L}(\alpha )\subseteq \mathcal {L}(\alpha _{1})\cup \mathcal {L}(\alpha _{2})\).

In order to prove \(\mathcal {L}(\alpha )\supseteq \mathcal {L}(\alpha _{1})\cup \mathcal {L}(\alpha _{2})\), we proceed analogously: If \(w\in \mathcal {L}(\alpha _{1})\cup \mathcal {L}(\alpha _{2})\), we can transform a *α*_{i}-parse tree for *w* into an *α*-parse tree by inserting a node \((\hat {w},\beta )\) (if necessary), and changing the labels accordingly. \(\square \) (Claim 1)

Note that this equivalence relies on the fact that *α* is vstar-free, which implies that *β* does not occur inside a Kleene star. For xregexes that are not vstar-free, we can only conclude \(\mathcal {L}(\alpha )\supseteq \mathcal {L}(\alpha _{1})\cup \mathcal {L}(\alpha _{2})\). This is easily seen considering the example of *x*{a}*y*{b}(&*x* ∨ &*y*)^{∗}, which would be rewritten to *x*{a}(&*x*)^{∗} and *y*{b}(&*y*)^{∗}.

We repeat this rewriting procedure on every created vstar-free xregex that is not an xregex path. This procedure terminates, as every rewriting removes a disjunction that contains at least one variable (binding or reference). Hence if *α* contains \(k\in \mathbb {N}_{>0}\) disjunctions, this process results in xregex paths *α*_{1}, …, *α*_{n} for some *n* ≤ 2^{k}, and \(\mathcal {L}(\alpha )=\bigcup _{i=1}^{n}\mathcal {L}(\alpha _{i})\).

*Example 3.18*

*α*:=

*x*{

*Σ*

^{∗}}⋅&

*x*⋅ (

*x*{

*Σ*

^{∗}} ∨

*y*{

*Σ*

^{∗}}) ⋅ (&

*x*∨ &

*y*) ⋅ &

*x*. Multiplying out the disjunctions, we obtain the following xregex paths:

This transformation process might result in an exponential number of xregex paths; but as efficiency is not of concern right now, this is not a problem (the followup paper Freydenberger [13] shows that this blowup can be avoided with a more involved construction). Each of these xregex paths is then transformed into a functional regex formula:

**Lemma 3.19**

*There is an algorithm that, given**α* ∈ XRP, *computes*\(\rho \in \textsf{RGX}^{\{\pi ,\zeta ^=\}}\)*with*\(\mathcal {L}(\rho )=\mathcal {L}(\alpha )\).

### Proof

Before we start with the proof, note that we can safely assume that *α* does not contain *∅*: If *∅* occurs inside a Kleene star (or a disjunction), that Kleene star (or disjunction) cannot contain any variable bindings or references, as *α* is an xregex path. Hence, we can remove *∅* as in the proof of Theorem 3.12. All other occurrences of *∅* imply \(\mathcal {L}(\alpha )=\emptyset \) – in this case, we are done.

Our goal is to rewrite the xregex path *α* into an equivalent core spanner of the form *π*_{∅}*S**δ*, where *δ* is a regex formula, and *S* is a sequence of string equality selections.

The main idea of the construction is quite straightforward: We basically replace each variable reference &*x* with a unique *x*_{i}{*Σ*^{∗}}, and use a string equality \(\zeta ^=_{x,x_{i}}\) to connect *x*_{i} with the appropriate binding. The only technical problem is that unlike regex formulas, xregexes allow variables to be bound multiple times. We solve this by using a unique variable for every occurrence of a variable binding in *α*.

As explained above, the xregex path *α* can be understood as a concatenation *α* = *α*_{1}⋯*α*_{n}, where each *α*_{i} is either a proper regular expression, a variable reference, or a variable binding of the form \(\alpha _{i} = x\{\hat {\alpha }\}\), where \(\hat {\alpha }\) is also an xregex path.

*x*in

*α*, exactly one of the following two cases applies:

- 1.
There is no binding

*x*{} in*α*that to the left of that occurrence of &*x*, or - 2.
there is a binding

*x*{} in*α*that is to the left of that occurrence of &*x*.

*x*will always default to

*ε*, which means that we can safely replace it with

*ε*.

In the second case, we see that this &*x* will always refer to the variable binding *x*{} that is closest to it to the left in *α*. In other words, we can simply read *α* from left to right. All &*x* before the first binding for *x* default to *ε*; and all &*x* after the first binding for *x* refer to the most recent binding for *x* (recall that, according to our definition of xregexes, no variable binding for a variable *x* may contain another binding of *x*).

*α*into an xregex path

*γ*with \(\mathcal {L}(\gamma )=\mathcal {L}(\alpha )\) such that no occurrence of a variable reference &

*x*in

*γ*refers to the default value

*ε*, and every variable binding

*x*{⋯} occurs at most once. This is done the following way: We read

*α*from left to right. If we encounter a reference &

*x*for which no binding has been seen, we replace it with

*ε*. If we encounter a binding

*x*{} that has already been seen before, we replace it with a binding for a new variable \(\hat {x}\), and all occurrences of &

*x*are renamed to \(\&\hat {x}\). (Of course, further occurrences of

*x*{} would require further new variables.) For example, the xregex path

*α*to

*γ*, the next step is to transform

*γ*into a regex formula

*δ*by replacing all variable references in a manner that is similar to the proof of Theorem 3.3. More specifically, we construct

*δ*by replacing, for each

*x*∈ Vars(

*γ*), the

*i*-th occurrence of &

*x*in

*γ*with

*x*

_{i}{

*Σ*

^{∗}}. Note that

*δ*is functional: Each variable in SVars(

*δ*) appears exactly once in

*δ*; and as

*δ*is also an xregex path, this implies that every

*δ*-parse tree contains every variable exactly once. (Recall that we assumed that

*α*does not contain

*∅*; hence, neither do

*γ*and

*δ*.)

For every variable *x* for which there occur references &*x* in *γ*, we define a selection \(\zeta ^=_{V_{x}}\), where *v*_{x} := {*x*} ∪ {*x*_{i} ∣ *x*_{i} occurs in *δ*}. We let *S* denote a sequence of these selections (the order is irrelevant), and define the spanner representation *ρ* := *π*_{∅}*S**δ*. As we simulate the behavior of each variable binding *x*{⋯} and its references &*x* using the selection \(\zeta ^=_{V_{x}}\), it is easy to see that \(\mathcal {L}(\rho )=\mathcal {L}(\gamma )\) and, hence, \(\mathcal {L}(\rho )=\mathcal {L}(\alpha )\). □

*Example 3.20*

As these spanner representations are Boolean, they are also union compatible. Hence, we can now combine Lemma 3.17 and Lemma 3.19 to observe the following.

**Theorem 3.21**

*There is an algorithm that, given**α* ∈ vsfXR, *computes*\(\rho \in \textsf{RGX}^{\{\pi ,\zeta ^=,\cup \}}\)*with*\(\mathcal {L}(\rho )=\mathcal {L}(\alpha )\).

*x*)

^{∗}.

While *L*_{imp} might seem to be an obvious witness that separates the classes of languages that are recognized by core spanners and by vstar-free xregexes, proving this appears to be quite involved. Instead, we consider a related language, which allows us to use the following tool:

**Definition 3.22**

Let \(k\in \mathbb {N}_{>0}\). We call a set \(A \subseteq \mathbb {N}^{k}\)*linear* if there exist an *r* ≥ 0 and \(m_{0},\ldots ,m_{r}\in \mathbb {N}^{k}\) with \(A=\{m_{0}+ m_{1} i_{1} + m_{2} i_{2} + {\cdots } + m_{r} i_{r} \mid i_{1},i_{2},\ldots ,i_{r}\in \mathbb {N}\}\) . A set \(A \subseteq \mathbb {N}^{k}\) is *semi-linear* if it is a finite union of linear sets. Assume *Σ* = {*a*_{1}, *a*_{2}, …, *a*_{k}} with |*Σ*| = *k*. The *Parikh map*\(\Psi \colon \Sigma ^{*}\to \mathbb {N}^{k}\) is defined by \(\Psi (w):= (|w|_{a_{1}},|w|_{a_{2}},\ldots ,|w|_{a_{k}})\), and is extended to languages by *Ψ*(*L*) := {*Ψ*(*w*)∣*w* ∈ *L*}. We call *L semi-linear* if *Ψ*(*L*) is semi-linear.

According to Parikh’s Theorem [32], every context-free language is semi-linear. Moreover, as shown by Ginsburg and Spanier [19], a set is semi-linear if and only if it is definable in Presburger arithmetic. Building on this, we state the following.

**Theorem 3.23**

*For every**α* ∈ vsfXR, *the language*\(\mathcal {L}(\alpha )\)*is semi-linear.*

### Proof

In order to increase the readability, we prove the claim for the case |*Σ*| = 2 (the adaption to larger alphabets is obvious). We assume *Σ* = {a, b} and define *Ψ*(a) := (1, 0) and *Ψ*(b) := (0, 1). Assume that Vars (*α*) = {*x*_{1}, …, *x*_{k}} for some \(k\in \mathbb {N}_{>0}\).

It suffices to prove the claim for *α* ∈ XRP, as semi-linear sets are closed under union, and (according to Lemma 3.17) every vstar-free xregex is equivalent to a finite union of xregex paths.

As explained in the proof of Lemma 3.19 (in the construction of *γ*), we can also assume without loss of generality that every variable binding *x*{⋯} occurs exactly once in *α*, and that no variable reference &*x*_{i} uses the default binding *ε*. In particular, this means that in every *α*-parse tree, each variable *x*_{i} stores exactly one word *w*_{i}.

Let *α* be an xregex path that satisfies these conditions. Our goal is to construct a Presburger formula *φ* such that \(\varphi(n^{\mathtt {a}},n^{\mathtt {b}})\) is true if and only if \((n^{\mathtt {a}},n^{\mathtt {b}})\in \Psi (\mathcal {L}(\alpha ))\). This formula will use variables \(x^{\mathtt {a}}_{i}\) and \(x^{\mathtt {b}}_{i}\) to represent |*w*_{i}\(|_{\mathtt{a}}\) and |*w*_{i}\(|_{\mathtt{b}}\), respectively. Recall that, due to our initial assumptions, each reference &*x*_{i} refers to the same word *w*_{i}; hence, we can safely define the corresponding variables \(x^{\mathtt {a}}_{i}\) and \(x^{\mathtt {b}}_{i}\) “globally” in *φ*.

*I*⊆{1, …,

*k*}. We use

**x**and

**x**

_{I}as abbreviations for the sequences \(x^{\mathtt {a}}_{1},x^{\mathtt {b}}_{1}, \ldots \), \(x^{\mathtt {a}}_{k},x^{\mathtt {b}}_{k}\) and \(\left (x^{\mathtt {a}}_{i},x^{\mathtt {b}}_{i} : i \in I\right )\), and define

*φ*

_{α}with Vars(

*α*) = {

*x*

_{1}, …,

*x*

_{k}} is constructed according to the following general procedure.

Given an xregex path *γ*, we define a Presburger formula *φ*_{γ} as follows: First, as *γ* is an xregex path, there is a decomposition *γ* = *γ*_{1} ⋅ *γ*_{2}⋯*γ*_{l}(\(l\in \mathbb {N}_{>0}\)), where each *γ*_{i} is either a proper regular expression, a variable reference, or a variable binding of the form \(x\{\hat {\gamma _{i}}\}\) such that \(\hat {\gamma _{i}}\) is also an xregex path. For each *γ*_{i}, we use variables \(n^{\mathtt {a}}_{i}\) and \(n^{\mathtt {b}}_{i}\) to denote the number of a or b that occur in the subword that is generated by *γ*_{i}.

*γ*

_{i}by

*x*∈ VarsBR(

*γ*

_{i})).

- If
*γ*_{i}is a proper regular expression, then as \(\mathcal {L}(\gamma _{i})\) is semi-linear (as a consequence of Parikh’s theorem [32], every regular language is semi-linear). Hence, due to Ginsburg and Spanier [19], there is a Presburger formula \(\hat {\varphi }_{\gamma _{i}}\) such that \(\hat {\varphi }_{\gamma _{i}}(n^{\mathtt {a}}, n^{\mathtt {b}})\) is true if and only if \((n^{\mathtt {a}}, n^{\mathtt {b}})\in \Psi (\mathcal {L}(\gamma _{i}))\). We defineIn order to avoid potential confusion, note that in this case \(\mathbf {x}_{\textsf{VarsBR}(\gamma _{i})}\) is the empty sequence. This is due to the fact that$$\varphi_{\gamma_{i}}\left( n^{\mathtt{a}}_{i},n^{\mathtt{b}}_{i},\mathbf{x}_{\textsf{VarsBR}(\gamma_{i})}\right):= \hat{\varphi}_{\gamma_{i}}\left( n^{\mathtt{a}}_{i},n^{\mathtt{b}}_{i}\right). $$*γ*_{i}is a proper regular expression, which implies VarsBR(*γ*_{i}) =*∅*. - If
*γ*_{i}= &*x*_{j}for some 1 ≤*j*≤*l*, we define$$\varphi_{\gamma_{i}}\left( n^{\mathtt{a}}_{i},n^{\mathtt{b}}_{i},\mathbf{x}_{{\textsf{VarsBR}\left( \gamma_{i}\right)}}\right):= \left( n^{\mathtt{a}}_{i}=x^{\mathtt{a}}_{j}\right) \wedge \left( n^{\mathtt{b}}_{i}=x^{\mathtt{b}}_{j}\right). $$ - If
*γ*_{i}=*x*_{j}{*δ*} for some 1 ≤*j*≤*l*and some xregex path*δ*, we define$$\varphi_{\gamma_{i}}\left( n^{\mathtt{a}}_{i},n^{\mathtt{b}}_{i},\mathbf{x}_{{\textsf{VarsBR}\left( \gamma_{i}\right)}}\right):= \left( n^{\mathtt{a}}_{i}=x^{\mathtt{a}}_{j}\right) \wedge \left( n^{\mathtt{b}}_{i}=x^{\mathtt{b}}_{j}\right) \wedge \varphi_{\delta}\left( n^{\mathtt{a}}_{i},n^{\mathtt{b}}_{i},\mathbf{x}_{\textsf{VarsBR}(\delta)}\right). $$

*φ*is still ensured to be finite and well-defined (as

*δ*is always a subexpression of

*γ*and, hence, shorter).

*x*

_{i}, each variable reference &

*x*

_{i}refers to the same word

*w*

_{i}. Taking this into account, we can prove that

We use Theorem 3.23 to separate the classes of languages that are recognized by core spanners and by vstar-free xregexes:

**Lemma 3.24**

*Let**L*_{nsl} := {(ab^{m})^{n} ∣ *m*, *n* ≥ 2} *and*\(\rho := \zeta ^{R_{\text {com}}}_{x,y} (x\{\mathtt {a}\mathtt {b}\mathtt {b}^{+}\}y\{\Sigma ^{+}\})\)*for**Σ* := {a, b}. *Then*\(L_{\text {nsl}}=\mathcal {L}(\rho )\), *but there is no**α* ∈ vsfXR*with*\(\mathcal {L}(\alpha )=L_{\text {nsl}}\).

### Proof

Assume that there is an *α* ∈ vsfXR with \(\mathcal {L}(\alpha )=L_{\text {nsl}}\). By Theorem 3.23, *L*_{nsl} must be semi-linear. Note that *Ψ*(*L*_{nsl}) = {(*n*, *m**n*)∣*m*, *n* ≥ 2}. As semi-linear sets are closed under projection (cf. Ginsburg and Spanier [19]), this implies that the set *C* := {*m**n* ∣ *m*, *n* ≥ 2} is semi-linear, and due to closure under complementation (also cf. [19]), the set *P* = {*p* ∣ *p* is prime, *p* = 0, or *p* = 1} is semi-linear as well. However, semi-linear sets are finite unions of linear sets, and so *P* contains a subset \(P_{c,a} := \{ c+an \mid n \in \mathbb {N}_{>0} \}\) of prime numbers for *c* ≥ 2 and *a* ≥ 2. Obviously, *c* + *a**c* = *c*(1 + *a*) ∈ *P*_{c, a}, but *c*(1 + *a*) is a composite number. Hence, there is no *α* ∈ vsfXR with \(\mathcal {L}(\alpha ) = L_{\text {nsl}}\). □

We do not need the join operator to define non-semi-linear languages: Consider the core spanner representation *ρ* from Example 3.14 with \(\mathcal {L}(\rho )=L_{\text {nsl}}\). If we construct \(\hat {\rho }\) as explained below that example, we obtain \(\mathcal {L}(\hat {\rho })=\{ww\mid w\in L_{\text {nsl}}\}\), which is also not semi-linear.

It is worth pointing out Lemma 3.24 does not resolve the open question from [7] whether there is a language that is recognized by a core spanner, but not by an xregex, as Theorem 3.23 only applies to vstar-free xregexes. We have already seen languages that are not semi-linear, but are recognized by xregexes: The language *L*_{nsl} is recognized by *α*_{nsl} := *x*{abb^{+}}&*x*^{+}; and a similar approach is used for the following language (which we already met in Example 2.4):

*Example 3.25*

Let *Σ* := {a}, and define the language *L*_{npr} := {a^{mn} ∣ *m*, *n* ≥ 2}. In other words, *L*_{npr} is the language of all words a^{i} with *i* ≥ 4 such that *i* is not a prime number. Let *α*_{npr} := *x*{aa^{+}} ⋅ (&*x*)^{+}. Then \(\mathcal {L}(\alpha _{\text {npr}})=L_{\text {npr}}\).

While *L*_{nsl} and *L*_{npr} are defined by very similar xregexes, the latter cannot be recognized by core spanners. In order to show this with a semi-linearity argument, we observe:

**Theorem 3.26**

*Let* |*Σ*| = 1 *and let P be a core spanner over**Σ*. *Then*\(\mathcal {L}(P)\)*is semi-linear.*

### Proof

We prove this by showing that on unary terminal alphabets, every EC^{reg}-language is semi-linear. Due to Theorem 3.12, this proves the claim.

*Σ*= {a}, and consider any EC

^{reg}-formula

*φ*(

*w*) over

*Σ*. We show that \(\mathcal {L}(\varphi )\) is semi-linear by converting

*φ*into a Presburger formula \(\hat {\varphi }\) for the set \(\Psi (\mathcal {L}(\varphi ))=\{|w|\mid w\in \mathcal {L}(\varphi )\}\). We obtain \(\hat {\varphi }\) by rewriting

*φ*in the following way:

- 1.
Each quantifier ∃

*x*is replaced with \(\exists \hat {x}\). - 2.
Each regular constraint

*L*_{A}(*x*) is replaced with a formula \(\hat {\varphi }_{A}(\hat {x})\) for the set \(\{|x|\mid x \in \mathcal {L}(A)\}\). As each \(\mathcal {L}(A)\) is a regular language, this is possible according to Ginsburg and Spanier [19]. - 3.
Each word equation

*η*_{L}=*η*_{R}is replaced with the equation sum(*η*_{L}) = sum(*η*_{R}), where the function sum is defined by sum(a) := 1, \(\text {sum}(x):= \hat {x}\) for*x*∈*X*, and sum(*α*⋅*β*) := sum(*α*) + sum(*β*).

*x*a

*x*

*y*

*x*= a

*y*

*z*

*z*

*y*a is converted into the Presburger equation \(\hat {x} + 1 + \hat {x} + \hat {y}+ \hat {x} = 1 + \hat {y}+ \hat {z}+ \hat {z}+ \hat {y} + 1\) (for

*Σ*= {a}). Intuitively, each variable \(\hat {x}\) in \(\hat {\varphi }\) contains the length of

*x*in

*φ*(which, as |

*Σ*| = 1, corresponds to the Parikh image of that word). Hence, the Presburger formula \(\hat {\varphi }\) defines the set \(\Psi (\mathcal {L}(\varphi ))\). According to [19], this implies that \(\Psi (\mathcal {L}(\varphi ))\) is semi-linear, which means that \(\mathcal {L}(\varphi )\) is semi-linear. This concludes the proof. □

Note that this construction only applies to unary alphabets, as this is the only case where there is a one-to-one correspondence between words and their Parikh images.

Apart from the observation that *L*_{npr} from Example 3.25 is not recognized by core spanners, Theorem 3.26 also allows us to conclude the following.

**Corollary 3.27**

*If* |*Σ*| = 1, *then*\(\mathcal {L}(P)\)*is regular for every core spanner P.*

In other words, for unary terminal alphabets, core spanners recognize exactly the same class as regular spanners, namely the class of regular languages (which, in the unary case, is identical to the class of context-free languages). Furthermore, Lemma 3.24 and Theorem 3.26 together show the following.

**Corollary 3.28**

*The class of languages that is recognized by core spanners is not closed under homomorphisms.*

^{reg}(and, hence, also not by EC or \(\textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}\)), it remains open whether the reverse direction holds as well. Secondly, although we know that \(\mathcal {L}(\textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}})\subseteq \mathcal {L}({\textsf{EC}^{\textsf{reg}}})\), we do not know whether this inclusion is strict. In fact, it even remains open whether there is a language that is recognized by EC, but not by \(\textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}\). This second set of question is discussed in more detail in Freydenberger [13].

## 4 Decision Problems

### 4.1 Spanner Evaluation

We first examine the *combined complexity* of the evaluation problem for core spanners. To this end, we define the problem CSp−Eval: Given a core spanner representation \(\rho \in \textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}\), a word *w* ∈ *Σ*^{∗}, and a (SVars(*ρ*), *w*)-tuple *μ*, is *μ* ∈ [[*ρ*]](*w*)? In order to prove lower bounds for this problem, we consider the membership problem for pattern languages: Given a pattern *α* and a word *w*, decide whether \(w\in \mathcal {L}(\alpha )\). As shown by Jiang et al. [24], this problem is NP-complete (for pattern languages that do not allow replacing variables with *ε*, this was already shown by Angluin [1]). Due to Theorem 3.3, we observe the following (the proof of NP-membership is straightforward).

**Theorem 4.1**

CSp−Eval*is*NP*-complete,**even if restricted to*\(\textsf{RGX}^{\{\pi ,\zeta ^=\}}\).

### Proof

In order to prove NP-hardness, it suffices to give a polynomial time reduction from the membership problem for pattern languages to CSp−Eval. Given a pattern *α* and a word *w*, we use Theorem 3.3 to construct a spanner representation \(\rho _{\alpha }\in \textsf{RGX}^{\{\zeta ^=\}}\) in polynomial time such that \(\mathcal {L}(\alpha )=\mathcal {L}(\rho _{\alpha })\). Next, we define *ρ* := *π*_{∅}*ρ*_{α}. As *ρ* represents a Boolean spanner, we define *μ* to be the empty tuple (). Now, *μ* ∈ [[*ρ*]](*w*) holds if and only if \(w\in \mathcal {L}(\alpha )\).

We prove membership in NP using the following NP-algorithm: Assume that we are given a core spanner representation *ρ*, a word *w* ∈ *Σ*^{∗}, and a *w*-tuple *μ*. For every regex formula *γ* in *ρ*, we nondeterministically guess a *w*-tuple *μ*_{γ}. By definition, each of these tuples has a size that is polynomial in |*w*|. In addition to this, for every union (*ρ*_{1} ∪ *ρ*_{2}), we guess a representation *ρ*_{i} that is ignored. We then verify these guesses deterministically: First, we discard all parts of *ρ* that are ignored, and obtain a spanner representation \(\hat {\rho }\in \textsf{RGX}psj\). For all remaining regex formulas *γ* in \(\hat {\rho }\), we check whether *μ*_{γ} is consistent with *γ* and *w*. Obviously, this can be done in polynomial time. If all of these checks pass, we evaluate all operators in \(\hat {\rho }\). As \(\hat {\rho }\) contains no unions, the result of these evaluations is always either *∅*, or a set that contains exactly one *w*-tuple. Hence, this process only takes polynomial time. Furthermore, when it terminates, it results either in *∅*, or in a *w*-tuple \(\hat {\mu }\). In the latter case, we return True if \(\hat {\mu }=\mu \). □

The question arises whether there are natural restrictions to CSp−Eval that make this problem tractable. It appears that any subclass of the core spanners that extends regular spanners in a meaningful way while having a tractable evaluation problem cannot be allowed to recognize the full class of pattern languages.

For pattern languages, it was shown by Ibarra et al. [23] that bounding the number of variables in the pattern leads to an algorithm for the membership problem with a running time that is polynomial, although in \(\mathcal {O}(n^{k})\) (where *n* is the length of the word *w*, and *k* the number of variables). From a parameterized complexity point of view (see e. g. Grohe and Flum [20]), this is usually not considered satisfactory. Without going too much into details, in parameterized complexity, one generally considers parameterized problems tractable that belong to the class FPT (from *fixed-parameter tractable*). This class is defined as follows: The input of a parameterized problem is a pair (*x*, *k*), where *x* is the input of the non-parameterized problem (e. g., a pattern *α* and a word *w*), and *k* is a parameter of the input (e. g., the number of variables in *α*). The parameterized problem is in FPT if there exist a computable function *f*, a constant *c* ≥ 0, and an algorithm that decides the problem in time *O*(*f*(*k*)*n*^{c}). We do not define the class W[1], but we note that the standard complexity theoretic assumption is that if a problem is W[1]-hard, it is not in FPT.

It was first observed by Stephan et al. [34] that the membership problem for pattern languages is W[1]-complete if the number of variable occurrences (not of variables) is used as a parameter (see Fernau et al. [11] for the full proof). As the number of variable occurrences in a pattern corresponds to the number of variables in an equivalent spanner, this implies that using the number of variables in a spanner as parameter leads to W[1]-hardness for this parameter of CSp−Eval.

Fernau and Schmid [10] and Fernau et al. [11] discuss these and various other potential restrictions to pattern languages that still do not lead to tractability (among these a bound on the length of the replacement of each variable, which corresponds to a bound on the length of spans). On the other hand, Reidenbach and Schmid [33] and Fernau et al. [9] examine parameters for patterns that make the membership problem tractable. While this does not directly translate to spanners, the authors consider these directions promising for further research.

But apart from these potential restrictions on the use of string equality, other restrictions are needed, as the use of join also makes evaluation intractable:

**Proposition 4.2**

CSp−Eval*is*NP*-complete,**even if restricted to*RGX^{{π,⋈}}.

### Proof

*G*= (

*V*,

*E*) and a number

*k*≤ |

*v*|, decide whether G contains a clique of size

*k*. This problem is NP-complete (cf. Garey and Johnson [18]). Consider an undirected graph

*G*= (

*V*,

*E*) with

*V*= {1, …,

*n*} for some

*n*≥ 1, and a number

*k*≤

*n*. Let a ∈

*Σ*and define

*w*:= a

^{n}and

*ρ*:= ⋈

_{1≤i<j≤k}

*α*

_{i,j}, where each

*α*

_{i,j}is defined by

*u*and

*v*, which allows [[

*α*

_{i,j}]](

*w*) to map

*x*

_{i}to the

*u*-th and

*x*

_{j}to the

*v*-th letter of

*w*. Then

*μ*∈ [[

*ρ*]](

*w*) holds if and only if there exist distinct nodes

*v*

_{1}, …,

*v*

_{k}∈

*V*such that {

*v*

_{i},

*v*

_{j}} ∈

*E*for all 1 ≤

*i*<

*j*≤

*k*; and

*μ*(

*x*

_{i}) = [

*v*

_{i},

*v*

_{i}+ 1〉 for 1 ≤

*i*≤

*k*. Thus, the empty tuple is an element of [[

*π*

_{∅}

*ρ*]](

*w*) if and only if

*G*contains a clique of size

*k*. □

We also consider the *data complexity* of the evaluation problem for core spanners. For every core spanner representation *ρ* over *Σ*, we define the decision problem CSp−Eval(*ρ*): Given a word *w* ∈ *Σ*^{∗} and a *w*-tuple *μ*, is *μ* ∈ [[*ρ*]](*w*)? Using a slight variation of the proof of Theorem 4.1, we obtain the following.

**Theorem 4.3**

CSp−Eval*(ρ)**is in*NLOGSPACE*for every*\(\rho \in \textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}\).

### Proof

This result follows from a slight change to the NP-decision procedure from the proof of Theorem 4.1: We can represent the guessed *w*-tuples *μ*_{γ} for each regex formula *γ* by using two pointers for each *μ*_{γ}(*x*) = [*i*, *j*〉 (one pointer for *i*, one for *j*). As *ρ* is fixed, a finite number of such pointers suffices to represent all *w*-tuples. Furthermore, the verification of these guesses can also be realized nondeterministically with only a constant amount of additional pointers. □

### 4.2 Static Analysis

- 1.
CSp−Sat: Is [[

*ρ*]](*w*) ≠*∅*for some*w*∈*Σ*^{∗}? - 2.
CSp−Hierarchicality: Is [[

*ρ*]] hierarchical? - 3.
CSp−Universality: Is [[

*ρ*]] = Υ_{SVars(ρ)}? - 4.
CSp−Equivalence: Is [[

*ρ*_{1}]] = [[*ρ*_{2}]]? - 5.
CSp−Containment: Is [[

*ρ*_{1}]] ⊆ [[*ρ*_{2}]]? - 6.
CSp−Regularity: Is [[

*ρ*]] ∈ [[RGX^{{π, ∪, ⋈}}]]?

^{reg}-formulas, for which satisfiability is in PSPACE (cf. Diekert [6]). Hence, we observe:

**Theorem 4.4**

*The problem*CSp−Sat*is*PSPACE*-complete,**even if it is restricted to spanner representations**from*\(\textsf{RGX}^{\{\zeta ^=\}}\).

### Proof

We begin with the upper bound. According to Theorem 3.12, for every core spanner representation *ρ*, there exists an EC^{reg}-formula *φ* that realizes [[*ρ*]]. Furthermore, *φ* can be computed in polynomial time. In particular, *φ* is satisfiable if and only if *ρ* is satisfiable. As satisfiability for EC^{reg}-formulas is in PSPACE (cf. Diekert [6]), this question can be answered in PSPACE.

*α*

_{1}, …,

*α*

_{n}, decide whether \(\bigcap _{i=1}^{n}\mathcal {L}(\alpha _{i})=\emptyset \). As a direct consequence of the proof of Lemma 3.2.3 in Kozen [27], this problem is PSPACE-complete (although Kozen’s proof uses automata, these are defined via regular expressions). Recall that every proper regular expression is also a functional regex formula. Hence, we can construct a Boolean spanner representation

*w*∈

*Σ*

^{∗}, we have

*P*(

*w*) ≠

*∅*if and only if there exists a word

*v*∈

*Σ*

^{∗}with

*w*=

*v*

^{n}and \(v\in \mathcal {L}(\alpha _{i})\) for 1 ≤

*i*≤

*n*. Hence,

*P*is satisfiable if and only if \(\bigcap _{i=1}^{n}\mathcal {L}(\alpha _{i})\neq \emptyset \). As PSPACE is closed under complementation, this proves PSPACE-hardness of CSp−Sat, even when restricted to representations from the class \(\textsf{RGX}^{\{\zeta ^=\}}\). □

The proof of the lower bound in Theorem 4.4 uses the PSPACE-hardness of the intersection emptiness problem for regular expressions. But even if the variables in the regex formulas were only bound to *Σ*^{∗}, it follows from Theorem 3.13 that this problem would still be at least as hard as the satisfiability problem for word equations without constraints. Considering that even proving the decidability was hard (see Diekert [6] for an overview), approaching CSp−Sat without knowledge on word equations would have required enormous additional effort.

It is also possible to use EC^{reg}-formulas to express a violation of the criteria for hierarchicality. This allows us to state the following result:

**Theorem 4.5**

*The problem*CSp−Hierarchicality*is*PSPACE*-complete,**even if it is restricted to*\(\textsf{RGX}^{\{\zeta ^=,\times \}}\).

### Proof

We begin with of the upper bound. The main idea is that non-hierarchicality can be expressed in EC^{reg}-formulas. Hence, our goal is to construct a polynomial time procedure that, given a core spanner representation \(\rho \in \textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}\), constructs an EC^{reg}-formula *φ*_{NH} that is satisfiable if and only if [[*ρ*]] is not hierarchical.

*P*and every word

*w*∈

*Σ*

^{∗}, a

*w*-tuple

*μ*∈

*P*(

*w*) is not hierarchical if there exist variables

*x*,

*y*∈ SVars(

*P*) such that all of the following hold:

- 1.
The span

*μ*(*x*) does not contain*μ*(*y*), - 2.
the span

*μ*(*y*) does not contain*μ*(*x*), and - 3.
the spans

*μ*(*x*) and*μ*(*y*) overlap (i. e., they are not disjoint).

*μ*(

*x*) and

*μ*(

*y*)

*strictly overlap*. It is easy to see that two spans [

*i*

_{1},

*j*

_{1}〉 and [

*i*

_{2},

*j*

_{2}〉 strictly overlap if one of the following

*strict overlap conditions*is met:

- 1.
*i*_{1}<*i*_{2}<*j*_{1}<*j*_{2}, - 2.
*i*_{2}<*i*_{1}<*j*_{2}<*j*_{1}.

^{reg}-formula

*φ*

_{ovl}(

*x*

^{P},

*x*

^{C},

*y*

^{P},

*y*

^{C}) that expresses the first condition when combined with an EC

^{reg}-formula that realizes a spanner (we do not need to define a formula for the second condition, as both conditions are symmetrical). To this purpose, we first define the EC

^{reg}-formula

*A*is an NFA with \(\mathcal {L}(A)=\Sigma ^{+}\). Clearly, (

*x*,

*y*) ∈

*Σ*

^{∗}×

*Σ*

^{∗}satisfies

*φ*

_{ppref}if and only if

*x*is a proper prefix of

*y*. Next, we define

^{reg}-formula that realizes a spanner. Hence,

*x*

^{P}and

*x*

^{C}represent a span [1 + |

*x*

^{P}|, 1 + |

*x*

^{P}

*x*

^{C}|〉 = [

*i*

_{1},

*j*

_{1}〉, while

*y*

^{P}and

*y*

^{C}represent a span [1 + |

*y*

^{P}|, 1 + |

*y*

^{P}

*y*

^{C}|〉 = [

*i*

_{2},

*j*

_{2}〉. In particular,

*x*

^{P}

*x*

^{C}and

*y*

^{P}

*y*

^{C}are both prefix of some common word

*w*. Hence,

*i*

_{1}<

*i*

_{2}holds if and only if

*x*

^{P}is a proper prefix of

*y*

^{P}. Likewise,

*i*

_{2}<

*j*

_{1}and

*j*

_{1}<

*j*

_{2}hold if and only if

*y*

^{P}is a proper prefix of

*x*

^{P}

*x*

^{C}, or

*x*

^{P}

*x*

^{C}is a proper prefix of

*y*

^{P}

*y*

^{C}, respectively.

In other words, *φ*_{ovl} checks whether the first of the two strict overlap conditions is satisfied.

*φ*

_{NH}. Let \(\rho \in \textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}\), and assume that SVars(

*ρ*) = {

*x*

_{1}, …,

*x*

_{n}} for some

*n*≥ 2 (spanners with less than two variables are trivially hierarchical). Using Theorem 3.12), we then construct an EC

^{reg}-formula \(\varphi _{\rho }(x_{w}, {x^{P}_{1}}, {x^{C}_{1}}, \ldots , {x^{P}_{n}}, {x^{C}_{n}})\) that realizes [[

*ρ*]]. We now define

*ρ*]] is not hierarchical. Then there exist a word

*w*∈

*Σ*

^{∗}, a

*w*-tuple

*μ*∈ [[

*ρ*]], and

*x*

_{l},

*x*

_{m}∈ SVars(

*ρ*) such that

*μ*(

*x*

_{l}) and

*μ*(

*x*

_{m}) strictly overlap. As

*φ*

_{ρ}realizes [[

*ρ*]], we have that

*μ*defines an assignment \((w, w_{[1,i_{1}\rangle }, w_{[i_{1},j_{1}\rangle }, \ldots , w_{[1,i_{n}\rangle }, w_{[i_{n},j_{n}\rangle })\) that satisfies this subformula (where [

*i*

_{k},

*j*

_{k}〉 =

*μ*(

*x*

_{k})). Furthermore, as

*μ*(

*x*

_{m}) and

*μ*(

*x*

_{l}) strictly overlap, either \(\varphi _{\text {ovl}}({x^{P}_{l}},{x^{C}_{l}},{x^{P}_{m}},{x^{C}_{m}})\) or \(\varphi _{\text {ovl}}({x^{P}_{m}},{x^{C}_{m}},{x^{P}_{l}},{x^{C}_{l}})\) is satisfied (if

*i*

_{l}<

*i*

_{m}or

*i*

_{m}<

*i*

_{l}, respectively). Hence,

*φ*

_{NH}is satisfiable.

Likewise, *φ*_{NH} is only satisfied if *φ*_{ρ} and (at least) one \(\varphi _{\text {ovl}}({x^{P}_{l}},{x^{C}_{l}},{x^{P}_{m}},{x^{C}_{m}})\) are satisfied. This corresponds to a *w*-tuple *μ* where *μ*(*x*_{l}) and *μ*(*x*_{m}) strictly overlap. Hence, *μ* is not hierarchical, which means that [[*ρ*]] is not hierarchical.

Therefore, *φ*_{NH} is satisfiable if and only if [[*ρ*]] is not hierarchical. Furthermore, *φ*_{NH} can be constructed in polynomial time, as we only need to construct *φ*_{ρ} (which is possible in polynomial time, according to the proof of Theorem 4.4), and an amount of *φ*_{ovl}-formulas that is quadratic in |SVars(*ρ*)|, each of which has a constant length. Both constructions rely solely on the syntax of *ρ*, and require no further computation.

As satisfiability of EC^{reg}-formulas can be decided in PSPACE, the complement of CSp−Hierarchicality is in PSPACE; and as PSPACE is closed under complementation, this means that CSp−Hierarchicality is in PSPACE.

*α*

_{1}, …,

*α*

_{n}, we define

*Σ*. By replacing each

*α*

_{i}in that proof with aaa ⋅

*α*

_{i}, we ensure that every word

*w*∈

*Σ*

^{∗}with \([{\kern -2.3pt}[{\zeta ^=_{x_{1},\ldots ,x_{n}}(x_{1}\{\mathtt {a}\mathtt {a}\mathtt {a}\cdot \alpha _{1}\}{\cdots } x_{n}\{\mathtt {a}\mathtt {a}\mathtt {a}\cdot \alpha _{n}\}}]{\kern -2.3pt}](w)\neq \emptyset \) has at least length 3 (which is the minimal word length for which non-hierarchical spanners are possible). Furthermore, for each such

*w*, the variable

*y*is assigned the span that contains all positions of

*w*except the last one, and

*z*is assigned the span that contains all positions except the first one. Hence, these spans strictly overlap, which means that

*ρ*is not hierarchical. On the other hand, if \([{\kern -2.3pt}[ \zeta ^=_{x_{1},\ldots ,x_{n}}(x_{1}\{\mathtt {a}\mathtt {a}\mathtt {a}\cdot \alpha _{1}\}\cdots x_{n}\{\mathtt {a}\mathtt {a}\mathtt {a}\cdot \alpha _{n}\}) ]{\kern -2.3pt}](w)=\emptyset \) , then [[

*ρ*]] =

*∅*. Therefore,

*ρ*is hierarchical if and only if there is no \(w\in \bigcap _{1\leq i\leq n}\mathcal {L}(\alpha _{i})\). As this problem is PSPACE-complete, CSp−Hierarchicality is PSPACE-hard. □

For the remaining problems, we use Theorem 3.21, and the fact that the undecidability results from Freydenberger [12] also hold for vstar-free xregexes:

**Theorem 4.6**

*The problems*CSp−Universality*and*CSp−Equivalence*are not semi-decidable, but co-semi-decidable. The problem*CSp−Regularity*is neither semi-decidable, nor co-semi-decidable.**These results hold even if the input is restricted**to*\(\textsf{RGX}^{\{\pi ,\zeta ^=,\cup \}}\).

### Proof

The co-semi-decidability of the first two problems is obvious. We discuss this for universality: For any core spanner representation *ρ*, we can always decide whether [[*ρ*]](*w*) = Υ_{SVars(ρ)}(*w*) holds. Hence, we can semi-decide non-universality by enumerating all *w* ∈ *Σ*^{∗} until we find a word *w* with [[*ρ*]](*w*) ≠ Υ_{SVars(ρ)}(*w*). Thus, CSp−Universality is co-semi-decidable. The proof for CSp−Equivalence works analogously.

*Σ*| ≥ 2, for xregexes

*α*, the following holds:

It is not semi-decidable whether \(\mathcal {L}(\alpha )=\Sigma ^{*}\),

It is neither semi-decidable, nor co-semi-decidable whether \(\mathcal {L}(\alpha )\) is a regular language.

*x*such that \(\mathcal {L}(\alpha )=\Sigma ^{*}\) if and only if \(\mathcal {X}\) accepts no input, and \(\mathcal {L}(\alpha _{\mathcal {X}})\) is regular if and only if \(\mathcal {X}\) accepts only finitely many inputs.

*Σ*= {0,

*#*} and, when adapted to the notation of this paper, are always of the following shape:

*α*

_{var}are proper regular expressions, while

*α*

_{i}are xregex paths that do not contain variable bindings, and no other variable references than &

*x*.

We note that the single variable biding *x*{0^{∗}} and all variable references &*x* do not occur under a Kleene star, and conclude that \(\alpha _{\mathcal {X}}\) is a vstar-free xregex.

By Theorem 3.21, we can effectively convert every \(\alpha _{\mathcal {X}}\) into a Boolean spanner representation \(\rho _{\mathcal {X}}\in \textsf{RGX}^{\{\pi ,\zeta ^=,\cup \}}\) with \(\mathcal {L}(\rho _{\mathcal {X}})=\mathcal {L}(\alpha _{\mathcal {X}})\).

Then \([{\kern -2.3pt}[ \rho _{\mathcal {X}} ]{\kern -2.3pt}]={\Upsilon }_{\emptyset }\) holds if and only if \(\mathcal {L}(\alpha _{\mathcal {X}})=\Sigma ^{*}\). As this question is not semi-decidable, CSp−Universality is also not semi-decidable. As CSp−Universality is a special case of CSp−Equivalence, the latter problem is also not semi-decidable.

Furthermore, \([{\kern -2.3pt}[ \rho _{\mathcal {X}} ]{\kern -2.3pt}]\) is a regular spanner if and only if \(\mathcal {L}(\alpha _{\mathcal {X}})\) is a regular language (as shown by Fagin et al. [7], when viewed as language definition mechanisms, regular spanners define exactly the class of regular languages). This question is neither semi-decidable, nor co-semi-decidable; hence, this applies to CSp−Regularity as well. □

As the proof of Theorem 4.6 relies only on Boolean spanners, the decidability status of CSp−Regularity does not change if the problem asks for hierarchical regularity (i. e., membership in [[RGX]]) instead of regularity, as the two classes coincide for Boolean spanners. Likewise, CSp−Universality remains not semi-decidable if one replaces Υ_{SVars(ρ)} with \({\Upsilon }^H_{\textsf{SVars}(\rho )}\).

In the construction from this proof, variables are only bound to a language *a*^{+}. Hence, the same undecidability results hold for spanners that use selections by equal length relation, instead of the string equality relation. While the proof builds on xregexes \(\alpha _{\mathcal {X}}\) that use only a single variable *x*, the resulting core spanners use an unbounded amount of variables, as every occurrence of a variable reference &*x* in an xregex path is converted to a spanner variable *x*_{i}. But undecidability remains even if we bound the number of variables in the spanners, as the \(\alpha _{\mathcal {X}}\) can be modified to use only a bounded number of variable references (see Section 4.1 in [12]). Theorem 4.6 also implies that CSp−Containment is not semi-decidable. This holds even for a more restricted class of spanners:

**Theorem 4.7**

*The problem*CSp−Containment*is not semi-decidable, even if it is restricted**to*\(\textsf{RGX}^{\{\pi ,\zeta ^=\}}\).

### Proof

This proof uses the undecidability of the inclusion problem for pattern languages, which is defined as follows: Given two patterns *α* and *β*, decide whether \(\mathcal {L}(\alpha )\subseteq \mathcal {L}(\beta )\).

For unbounded sizes of *Σ*, this undecidability was proven by Jiang et al. [25], and Freydenberger and Reidenbach [15] adapted this proof to all (non-unary) finite terminal alphabets.

Given two patterns *α*, *β*, we can use Theorem 3.3 to construct spanner representations \(\rho _{\alpha },\rho _{\beta }\in \textsf{RGX}^{\{\zeta ^=\}}\) with \(\mathcal {L}(\rho _{X})=\mathcal {L}(X)\) for *X* ∈ {*α*, *β*}, and turn these into representations of Boolean spanners \(\hat {\rho }_{X}:=\pi _{\emptyset }\rho _{X}\). Then \([{\kern -2.3pt}[ \hat {\rho }_{\alpha } ]{\kern -2.3pt}](w)\subseteq [{\kern -2.3pt}[ \hat {\rho }_{\beta } ]{\kern -2.3pt}](w)\) holds for all *w* ∈ *Σ*^{∗} if and only if \(\mathcal {L}(\alpha )\subseteq \mathcal {L}(\beta )\).

This shows that CSp−Containment is not decidable. As it is obviously co-semi-decidable, this also shows that CSp−Containment is not semi-decidable. □

As shown by Bremer and Freydenberger [4], the inclusion problem for pattern languages remains undecidable if the number of variables in the patterns is bounded. In fact, that proof constructs patterns where even the number of variable occurrences is bounded. Therefore, CSp−Containment is not semi-decidable even if restricted to representations from \(\textsf{RGX}^{\{\pi ,\zeta ^=\}}\) with a bounded number of variables. It is a hard open question whether the equivalence problem for pattern languages is decidable (cf. Ohlebusch and Ukkonen [31], Freydenberger and Reidenbach [15]). Undecidability of this problem would imply undecidability of CSp−Equivalence, even if restricted to representations from \(\textsf{RGX}^{\{\pi ,\zeta ^=\}}\).

Problem | Status | Reference |
---|---|---|

CSp−Eval | NP-complete | Theorem 4.1, Proposition 4.2 |

CSp−Eval( | in NLOGSPACE | Theorem 4.3 |

CSp−Sat | PSPACE-complete | Theorem 4.4 |

CSp−Hierarchicality | PSPACE-complete | Theorem 4.5 |

CSp−Universality | co-semi-decidable, not semi-decidable | Theorem 4.6 |

CSp−Equivalence | co-semi-decidable, not semi-decidable | Theorem 4.6 |

CSp−Containment | co-semi-decidable, not semi-decidable | Theorem 4.7 |

CSp−Regularity | neither semi-, nor co-semi-decidable | Theorem 4.6 |

Details under which restrictions the lower bounds persist can be found in the respective results.

#### 4.2.1 Minimization and Relative Succinctness

In order to address the minimization of spanner representations, we first formalize the notion of the size or complexity of a spanner representation. Even for proper regular expressions, there are various different definitions of size, see e. g. Holzer and Kutrib [22], and there might be convincing reasons to add additional weight to the number of variables or other parameters. As we shall see, these distinctions do not affect the negative results that we prove later. Hence, instead of defining a single fixed notion of size, we use the following general definition of complexity measures from Kutrib [29]:

**Definition 4.8**

Let SR be a class of spanner representations. A complexity measure for SR is a recursive function \(c\colon \textsf{SR}\to \mathbb {N}\) such that for each *Σ*, the set of all *ρ* ∈ SR that represent spanners over *Σ* can be effectively enumerated in order of increasing *c*(*ρ*), and does not contain infinitely many *ρ* ∈ SR with the same value *c*(*ρ*).

By *recursive*, we mean a function that is total and computable. Definition 4.8 is general enough to include all notions of complexity that take into account that descriptions are commonly encoded with a finite number of distinct symbols, and that it should be decidable if a word over these symbols is a valid encoding from SR. Regardless of the chosen complexity measure, computable minimization of core spanners is impossible:

**Theorem 4.9**

*Let c be a complexity measure for*\(\textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}\) . *There is no algorithm that, given a*\(\rho \in \textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}\), *computes an equivalent*\(\hat {\rho }\in \textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}\)*that is c-minimal.*

### Proof

Define U_{min} to be the set of c-minimal core spanner representations of Υ_{∅}. By the definition of a complexity measure, U_{min} is finite. Hence, given a core spanner representation *ρ*, we can decide whether *ρ* ∈ U_{min}.

Now assume there is an algorithm MIN_{c} that minimizes core spanner representations with respect to *c*. Given a core spanner representation *ρ*, we can decide whether [[*ρ*]] = [[Υ_{∅}]], by checking whether MIN_{c}(*ρ*) ∈ U_{min}. But as shown in Theorem 4.6, this problem is undecidable. Hence, MIN_{c} cannot exist. □

In addition to regex formulas, Fagin et al. [7] also define spanner representations that are based on so-called vset- and vstk-automata (denoted by VA_{set} and VA_{stk}). They show [[VA_{stk}]] = [[RGX]] and [[VA_{set}]] = [[RGX^{{π, ∪, ⋈}}]], and conclude that \([{\kern -2.3pt}[ \textsf{VA}_{\textsf{set}}^{{\{\pi ,\zeta ^=,\cup ,\bowtie \}}} ]{\kern -2.3pt}]=[{\kern -2.3pt}[ \textsf{VA}_{\textsf{stk}}^{{\{\pi ,\zeta ^=,\cup ,\bowtie \}}} ]{\kern -2.3pt}]=[{\kern -2.3pt}[ \textsf{RGX}^{{\{\pi ,\zeta ^=,\cup ,\bowtie \}}} ]{\kern -2.3pt}]\). Without going futher into details, we note that their equivalence proofs use computable conversions between the models. Hence, Theorem 4.9 also applies to those spanner representations from [7] that can express core spanners, like \(\textsf{VA}_{\textsf{stk}}^{{\{\pi ,\zeta ^=,\cup ,\bowtie \}}}\) and \(\textsf{VA}_{\textsf{set}}^{{\{\pi ,\zeta ^=,\cup ,\bowtie \}}}\), and it implies that an algorithm that converts from one of these classes of representations to another cannot guarantee that its result is minimal.

Using a technique by Hartmanis [21], we can use the fact that CSp−Regularity is not co-semi-decidable to compare the relative succinctness of regular and core spanner representations:

**Theorem 4.10**

*Let**c*_{1}*and**c*_{2}*be complexity measures for the classes*RGX^{{π, ∪, ⋈}}*and*\(\textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}\), *respectively. For every recursive function*\(f\colon \mathbb {N}\to \mathbb {N}\), *there exists a*\(\rho \in \textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}\)*such that* [[*ρ*]] ∈ [[RGX^{{π, ∪, ⋈}}]], *but*\(c_{1}(\hat {\rho })>f(c_{2}(\rho ))\)*holds for every*\(\hat {\rho }\in \textsf{RGX}^{\{\pi ,\cup ,\bowtie \}}\)*with*\([{\kern -2.3pt}[{\hat {\rho }}]{\kern -2.3pt}]=[{\kern -2.3pt}[\rho ]{\kern -2.3pt}]\).

### Proof

*c*

_{1}for RGX

^{{π, ∪, ⋈}}and

*c*

_{2}for \(\textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}\), as well as a recursive function

*f*such that, for every core spanner representation \(\rho \in \textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}\) with [[

*ρ*]] ∈ [[RGX

^{{π, ∪, ⋈}}]], there exists a regular spanner representation \(\hat {\rho }\in \textsf{RGX}^{\{\pi ,\cup ,\bowtie \}}\) with \([{\kern -2.3pt}[ \hat {\rho } ]{\kern -2.3pt}]=[{\kern -2.3pt}[ \rho ]{\kern -2.3pt}]\) and \(c_{1}(\hat {\rho })\leq f(c_{2}(\rho ))\). Our goal is to show that this implies that the set

*n*:=

*f*(

*c*

_{2}(

*ρ*)). We define

*F*

_{n}is finite, and we can effectively enumerate its elements

*ρ*

_{1}, …,

*ρ*

_{k}for

*k*:= |

*F*

_{n}|.

Also by definition, we know that if there exists a *ρ*_{R} ∈ RGX^{{π, ∪, ⋈}} with [[*ρ*_{R}]] = [[*ρ*]], there exists a \(\hat {\rho }_{R}\in \textsf{RGX}^{\{\pi ,\cup ,\bowtie \}}\) with \([{\kern -2.3pt}[ \hat {\rho }_{R} ]{\kern -2.3pt}]=[{\kern -2.3pt}[ \rho ]{\kern -2.3pt}]\) and \(\hat {\rho }_{R}\in F_{n}\). In other words: If [[*ρ*]] is expressible with regular spanners, it is expressible with a regular spanner representation \(\hat {\rho }\) that satisfies the complexity bound *n*.

For all *ρ*_{i} ∈ *F*_{n}, we now semi-decide [[*ρ*]] ≠ [[*ρ*_{i}]]. In order to do this, we enumerate all *w* ∈ *Σ*^{∗}. In each step, if [[*ρ*]](*w*) ≠ [[*ρ*_{i}]](*w*) holds, we mark *ρ*_{i} as not equivalent to *ρ*.

If all spanners in *F*_{n} are marked, we know that no regular spanner [[*ρ*_{R}]] with [[*ρ*_{R}]] = [[*ρ*]] exists, and put out True. As *F*_{n} is finite, this point is reached in a finite number of steps if there is no such spanner. On the other hand, if such a spanner exists, the procedure will never terminate. Hence, we have defined a semi-decision procedure for NR, which implies that CSp−Regularity is co-semi-decidable, a contradiction to Theorem 4.6. □

Hence, the blowup from \(\textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}\) to RGX^{{π, ∪, ⋈}} is not bounded by any recursive function. As above, we can replace each of this classes with a class with the same expressive power; for example, we can replace RGX^{{π, ∪, ⋈}} with VA_{stk}^{{π, ∪, ⋈}}, VA_{set}, or VA_{set}^{{π, ∪, ⋈}} (or, as the proof uses Boolean spanners, RGX or VA_{stk}, or any class between those).

We also consider the relative succinctness of representations of core spanners and representations of their complements. For every spanner *P*, we define its *complement**c**o**m**p**l*(*P*) := Υ_{Vars(P)} ∖ *P*, and its *hierarchical complement*\(complH(P):= {\Upsilon }^H_{\textsf{Vars}(P)}\setminus P\).

**Theorem 4.11**

*Let c be a complexity measure for*\(\textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}\).

*For every recursive function*\(f\colon \mathbb {N}\to \mathbb {N}\),

*there exists a*\(\rho \in \textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}\)

*such that*

- 1.
\(\textsf{C}([{\kern -2.3pt}[ \rho ]{\kern -2.3pt}])\in [{\kern -2.3pt}[ \textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}} ]{\kern -2.3pt}]\),

*but* - 2.
\(c(\rho )>f(c(\hat {\rho }))\)

*holds for every*\(\hat {\rho }\in \textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}\)*with*\([{\kern -2.3pt}[\hat {\rho }]{\kern -2.3pt}]=compl([{\kern -2.3pt}[\rho ]{\kern -2.3pt}])\).

*This also holds if we consider*C

^{H}

*instead of*C.

### Proof

^{H}separately). For convenience, we define the set of all Boolean core spanner representations

Regarding the next level, \({\Sigma ^{0}_{2}}\) is the family of all sets that are semi-decidable when using oracles for sets in \({\Sigma ^{0}_{1}}\) (or in \({\Pi ^{0}_{1}}\)), \({\Pi ^{0}_{2}}\) is the family of all sets that are co-semi-decidable when using such oracles. Finally, \({\Delta ^{0}_{2}}={\Sigma ^{0}_{2}}\cap {\Pi ^{0}_{2}}\) is the family of all sets that are decidable when using oracles for sets in \({\Sigma ^{0}_{1}}\) or in \({\Pi ^{0}_{1}}\). □

A central part of our reasoning in this proof is the following observation:

*Claim 1*

\({\textsf{COF}\not \in \Delta ^{0}_{2}}\).

### Proof

As shown in Freydenberger [12], the xregexes that we used in the proof of Theorem 4.6 also prove that co-finiteness for vstar-free xregexes is \({\Sigma ^{0}_{2}}\)-complete.

Hence, the proof of Theorem 4.6 also implies that COF is \({\Sigma ^{0}_{2}}\)-hard. This immediately implies \({\textsf{COF}\notin \Delta ^{0}_{2}}\); as otherwise, \({\Sigma ^{0}_{2}}={\Delta ^{0}_{2}}\) would hold, which contradicts the fact that the arithmetical hierarchy is a proper hierarchy. \(\square \) (Claim 1)

Our goal is to use Claim 1 to obtain the contradiction on which this proof rests. More precisely, we shall prove that any recursive bound on the size of the core spanner for a complement can be used to prove \({\textsf{COF}\in \Delta ^{0}_{2}}\). One of the central parts of our reasoning shall be the following result.

*Claim 2*

\({\textsf{FIN}\in \Sigma ^{0}_{1}}\).

### Proof

*ρ*∈ BCSR. Enumerate all finite sets

*S*⊂

*Σ*

^{∗}. For each set, we check the following two conditions:

- 1.
\(S\subseteq \mathcal {L}(\rho )\)

- 2.
\(\mathcal {L}(\rho )\cap (\Sigma ^{*}\setminus S)=\emptyset \)

*S*is finite, the first condition can be checked by deciding if \(w\in \mathcal {L}(\rho )\) for each

*w*∈

*S*.

For the second condition, we first construct a regular expression *α* with \(\mathcal {L}(\alpha )= (\Sigma ^{*}\setminus S)\). Then, we define the Boolean core spanner representation *ρ*_{S} := *α* ∩ *ρ*. As \(\mathcal {L}(\rho _{S})=\mathcal {L}(\alpha )\cap \mathcal {L}(\rho )=(\Sigma ^{*}\setminus S)\cap \mathcal {L}(\rho )\), we can decide the second condition by checking if \(\mathcal {L}(\rho _{S})=\emptyset \) (which is decidable, according to Theorem 4.4).

If *S* satisfies both conditions, \(S=\mathcal {L}(\rho )\) holds. Hence, \(\mathcal {L}(\rho )\) is finite, and the semi-decision procedure returns True. Furthermore, for every *ρ* ∈ FIN, the procedure will (after a finite number of enumerated finite sets) check the set \(S=\mathcal {L}(\rho )\), and then return True. Thus, FIN is semi-decidable, which is equivalent to \({\textsf{FIN}\in \Sigma ^{0}_{1}}\). \(\square \) (Claim 2)

The next observation is not very deep; but in order to streamline the flow of our later reasoning, we state it as a separate claim.

*Claim 3*

For every *ρ* ∈ BCSR, we have that *ρ* ∈ COF holds if and only if there is a \(\hat {\rho }\in \textsf{FIN}\) with \([{\kern -2.3pt}[ \hat {\rho } ]{\kern -2.3pt}]=\textsf{C}([{\kern -2.3pt}[ \rho ]{\kern -2.3pt}])\).

### Proof

Let *ρ* ∈ BCSR. We begin with the *if*-direction. Assume there exists a \(\hat {\rho }\in \textsf{FIN}\) with \([{\kern -2.3pt}[ \hat {\rho } ]{\kern -2.3pt}]=\textsf{C}([{\kern -2.3pt}[ \rho ]{\kern -2.3pt}])\). As \(\hat {\rho }\in \textsf{FIN}\), the language \(\mathcal {L}(\hat {\rho })\) is finite, which implies that \(\mathcal {L}(\rho )=\Sigma ^{*}\setminus \mathcal {L}(\hat {\rho })\) is co-finite. Hence, *ρ* ∈ COF.

For the *only-if* direction, let *ρ* ∈ COF; i. e., \(\mathcal {L}(\rho )\) is co-finite. Hence, \(\Sigma ^{*}\setminus \mathcal {L}(\rho )\) is finite, and regular. Thus, there exists a proper regular expression \(\hat {\rho }\) with \(\mathcal {L}(\hat {\rho })=\Sigma ^{*}\setminus \mathcal {L}(\rho )\). As every proper regular expression is also a functional regex formula with no variables (and, hence, Boolean), \(\hat {\rho }\in \textsf{BCSR}\) follows. This gives \(\hat {\rho }\in \textsf{FIN}\), while \([{\kern -2.3pt}[ \hat {\rho } ]{\kern -2.3pt}]=\textsf{C}([{\kern -2.3pt}[ \rho ]{\kern -2.3pt}])\) holds by our choice of \(\hat {\rho }\). \(\square \) (Claim 3)

We now proceed to the main part of the proof, which uses these claims. Let *c* be a complexity measure for the class \(\textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}\). Assume that there exists a recursive function \(f\colon \mathbb {N}\to \mathbb {N}\) such that for all \(\rho \in \textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}\) for which C([[*ρ*]]) is a core spanner, there exists a \(\hat {\rho }\in \textsf{RGX}^{\{\pi ,\zeta ^=,\cup ,\bowtie \}}\) with \([{\kern -2.3pt}[ \hat {\rho } ]{\kern -2.3pt}]=\textsf{C}([{\kern -2.3pt}[ \rho ]{\kern -2.3pt}])\) and \(c(\rho )\leq f(c(\hat {\rho }))\).

*ρ*∈ BCSR as follows. First, compute

*n*:=

*f*(

*c*(

*ρ*)), and let

*ρ*∈ COF if and only if there is a \(\hat {\rho }\in \textsf{FIN}\) with \([{\kern -2.3pt}[ \hat {\rho } ]{\kern -2.3pt}]=\textsf{C}([{\kern -2.3pt}[ \rho ]{\kern -2.3pt}])\). Due to our assumption on

*f*, this holds if and only if such a \(\hat {\rho }\) exists in

*R*

_{n}.

- 1.
\(\hat {\rho }\in \textsf{FIN}\)

- 2.
\([{\kern -2.3pt}[ \hat {\rho } ]{\kern -2.3pt}]=\textsf{C}([{\kern -2.3pt}[ \rho ]{\kern -2.3pt}])\)

Regarding the second criterion, note that \([{\kern -2.3pt}[ \hat {\rho } ]{\kern -2.3pt}]\neq \textsf{C}([{\kern -2.3pt}[ \rho ]{\kern -2.3pt}])\) is semi-decidable (as it suffices to find a *w* ∈ *Σ*^{∗} that disproves the equality). Hence, this criterion is co-semi-decidable, which means that it can be decided with a \({\Pi ^{0}_{1}}\)-oracle.

If there exists a \(\hat {\rho }\in R_{n}\) that satisfies both criteria, the procedure returns True. In this case, *ρ* ∈ COF holds by Claim 3; hence, this is correct.

If no such \(\hat {\rho }\) can be found among the (finitely many) elements of *R*_{n}, the procedure returns False. As mentioned above, this is correct due to our assumptions on *f*.

As COF can be decided by using oracles for \({\Sigma ^{0}_{1}}\) and \({\Pi ^{0}_{1}}\), we know that \(\textsf{COF}\in {\Delta ^{0}_{2}}\) must hold. This contradicts Claim 1. As our only assumption was the existence of the recursive bound *f*, no such bound can exist.

In other words, there are core spanners where the (hierarchical) complement is also a core spanner, but the blowup between their representations is not bounded by any recursive function. Again, this holds for the other classes of representations as well.

This result has consequences to an open question of Fagin et al. One of the central tools in [7] is the core-simplification-lemma, which states that every core spanner is definable by an expression of the form *π*_{V}*S**A*, where *A* is a vset-automaton, *V* ⊆ SVars(*A*), and *S* is a sequence of selections \(\zeta ^=_{x,y}\) for *x*, *y* ∈ SVars(*A*).

In addition to core spanners, Fagin et al. also discuss adding a set difference operator ∖, and ask “whether we can find a simple form, in the spirit of the core-simplification lemma, when adding difference to the representation of core spanners”. It is a direct consequence of Theorem 4.11 that such a simple representation, if it exists, cannot be obtained effectively, as reducing the number of difference operators can lead to a non-recursive blowup. While this observation does not prove that such a simple form does not exist, it suggests that any proof of its existence should be expected to be non-constructive.

## 5 Conclusions and Further Work

In Section 3, we have seen that core spanners can express languages that are defined by patterns or by vstar-free xregexes. We used this in Section 4 to derive various lower bounds on decision problems, even for subclasses of core spanner representations. Note that in most of the cases, these lower bounds do not require the join operator, and mostly rely on the string equality selection. This can be interpreted as a sign that string equality (or repetition) is an expensive operator, in particular as similar results have been observed for related models (e. g., [2, 12, 16]). On the other hand, Proposition 4.2 demonstrates that even without string equality, join is also an expensive operator. The authors take this as a sign that the search for good restrictions on core spanners will probably have to combine restrictions on string equality and on join.

There is also reason to hope that the connections to patterns and word equations can be beneficial for spanners: There is recent work on restricted classes of pattern languages with an efficient membership problem (e. g., [10, 33]), which could lead to subclasses of spanners that can be evaluated more efficiently. Furthermore, as Theorems 3.12 and 3.13 show, core spanners and word equations with regular constraints are closely related. Recent work on word equations has also considered tasks like enumerating all solutions of an equation. The employed compression techniques (cf. [6]) might also be used to improve the evaluation of core spanners. In particular, the EC^{reg}-formulas that are constructed in the proof of Theorem 3.12 have the special property that there is a variable *x*_{w} (for *w*), and for every solution *σ* and every variable *x*, we have that *σ* (*x*) is a subword of *σ* (*x*_{w}).

Freydenberger [13] builds on this observation and introduces a fragment of EC^{reg} that has exactly the same expressive power as core spanners. The connection is even stronger: As shown in [13], there exist polynomial time conversions between this fragment and core spanner representations. It remains to be seen whether the connection between spanners and word equations can also be used to find interesting subclasses of core spanners that have friendlier upper bounds (in particular regarding evaluation).

Also note that conversion from vstar-free regular expressions to core spanner representations that is used for Theorem 3.21 can lead to an exponential increase in size. As shown in [13], this blowup can be avoided by using a more involved construction.

Finally, while we only mentioned this explicitly in Section 4.2.1, note that most of the other results in this paper can also be directly converted to the appropriate spanner representations that use vset- and vstk-automata from [7].

## Footnotes

## Notes

### Acknowledgements

We thank Florin Manea for his suggestion to use word equations with regular constraints, and Thomas Zeume for reporting a list of typos. We also thank the anonymous reviewers of both this paper and the conference version for their feedback.

### References

- 1.Angluin, D.: Finding patterns common to a set of strings. J. Comput. Syst. Sci.
**21**, 46–62 (1980)MathSciNetCrossRefMATHGoogle Scholar - 2.Barceló, P., Libkin, L., Lin, A.W., Wood, P.T.: Expressive languages for path queries over graph-structured data. ACM Trans. Database Syst.
**37**(4), 31 (2012)CrossRefGoogle Scholar - 3.Barceló, P., Muñoz, P.: Graph Logics with Rational relations: The Role of Word Combinatorics. In: Proc. CSL-LICS 2014 (2014)Google Scholar
- 4.Bremer, J., Freydenberger, D.D.: Inclusion problems for patterns with a bounded number of variables. Inform. Comput.
**220–221**, 15–43 (2012)MathSciNetCrossRefMATHGoogle Scholar - 5.Câmpeanu, C., Salomaa, K., Yu, S.: A formal study of practical regular expressions. Int. J. Found Comput. Sci.
**14**, 1007–1018 (2003)MathSciNetCrossRefMATHGoogle Scholar - 6.Diekert, V.: Makanin’s Algorithm. In: Lothaire, M. (ed.) Algebraic Combinatorics on Words, chapter 12, pages 387–442. Cambridge University Press (2002)Google Scholar
- 7.Fagin, R., Kimelfeld, B., Reiss, F., Vansummeren, S.: Document spanners: A formal approach to information extraction. J. ACM
**62**(2), 12 (2015)MathSciNetCrossRefMATHGoogle Scholar - 8.Fagin, R., Kimelfeld, B., Reiss, F., Vansummeren, S.: Declarative cleaning of inconsistencies in information extraction. ACM Trans. Database Syst.
**41**(1), 6 (2016)MathSciNetCrossRefGoogle Scholar - 9.Fernau, H., Manea, F., Mercas, R., Schmid, M.L.: Pattern Matching with variables: Fast Algorithms and New Hardness Results. In: Proc. STACS 2015 (2015)Google Scholar
- 10.Fernau, H., Schmid, M.L.: Pattern matching with variables: A multivariate complexity analysis. Inf. Comput.
**242**, 287–305 (2015)MathSciNetCrossRefMATHGoogle Scholar - 11.Fernau, H., Schmid, M.L., Villanger, Y.: On the parameterised complexity of string morphism problems. Theory Comput. Sys. (2015)Google Scholar
- 12.Freydenberger, D.D.: Extended regular expressions: Succinctness and decidability. Theory Comput. Sys.
**53**(2), 159–193 (2013)MathSciNetCrossRefMATHGoogle Scholar - 13.Freydenberger, D.D.: A Logic for Document Spanners. In: Proc ICDT. Accepted (2017)Google Scholar
- 14.Freydenberger, D.D., Holldack, M.: Document spanners: From Expressive Power to Decision Problems. In: Proc. ICDT 2016, p 2016Google Scholar
- 15.Freydenberger, D.D., Reidenbach, D.: Bad news on decision problems for patterns. Inform. Comput.
**208**(1), 83–96 (2010)MathSciNetCrossRefMATHGoogle Scholar - 16.Freydenberger, D.D., Schweikardt, N.: Expressiveness and static analysis of extended conjunctive regular path queries. J. Comput. Syst. Sci.
**79**(6), 892–909 (2013)MathSciNetCrossRefMATHGoogle Scholar - 17.Friedl, J.E.F.: Mastering Regular Expressions. O’Reilly Media. 3rd edition (2006)Google Scholar
- 18.Garey, M.R., Johnson, D.S.: Computers and intractability. W. H. Freeman and Company (1979)Google Scholar
- 19.Ginsburg, S., Spanier, E.: Semigroups, presburger formulas, and languages. Pac. J. Math.
**16**(2), 285–296 (1966)MathSciNetCrossRefMATHGoogle Scholar - 20.Grohe, M., Flum, J.: Parameterized complexity theory. Texts in Theoretical Computer Science. Springer (2006)Google Scholar
- 21.Hartmanis, J.: On gödel speed-up and succinctness of language representations. Theor. Comput. Sci.
**26**(3), 335–342 (1983)CrossRefMATHGoogle Scholar - 22.Holzer, M., Kutrib, M.: Descriptional complexity–an introductory survey. Sci. Appl. Language Methods
**2**, 1–58 (2010)MathSciNetMATHGoogle Scholar - 23.Ibarra, O.H., Pong, T.-C., Sohn, S.M.: A note on parsing pattern languages. Pattern Recogn. Lett.
**16**(2), 179–182 (1995)CrossRefGoogle Scholar - 24.Jiang, T., Kinber, E., Salomaa, A., Salomaa, K., Yu, S.: Pattern languages with and without erasing. Int. J Comput. Math.
**50**, 147–163 (1994)CrossRefMATHGoogle Scholar - 25.Jiang, T., Salomaa, A., Salomaa, K., Yu, S.: Decision problems for patterns. J. Comput. Syst Sci.
**50**, 53–63 (1995)MathSciNetCrossRefMATHGoogle Scholar - 26.Karhumȧki, J., Mignosi, F., Plandowski, W.: The expressibility of languages and relations by word equations. J. ACM
**47**(3), 483–505 (2000)MathSciNetCrossRefMATHGoogle Scholar - 27.Kozen, D.: Lower Bounds for Natural Proof Systems. In: Proc. FOCS 1977, p 1977Google Scholar
- 28.Kozen, D.: Theory of computation. Springer-Verlag (2006)Google Scholar
- 29.Kutrib, M.: The phenomenon of non-recursive trade-offs. Int. J. Found. Comput. Sci.
**16**(5), 957–973 (2005)MathSciNetCrossRefMATHGoogle Scholar - 30.Lothaire, M.: Combinatorics on Words. Cambridge University Press (1997)Google Scholar
- 31.Ohlebusch, E., Ukkonen, E.: On the equivalence problem for E-pattern languages. Theor. Comput. Sci.
**186**, 231–248 (1997)MathSciNetCrossRefMATHGoogle Scholar - 32.Parikh, R.J.: On context-free languages. J. ACM
**13**(4), 570–581 (1966)MathSciNetCrossRefMATHGoogle Scholar - 33.Reidenbach, D., Schmid, M.L.: Patterns with bounded treewidth. Inform. Comput.
**239**, 87–99 (2014)MathSciNetCrossRefMATHGoogle Scholar - 34.Stephan, F., Yoshinaka, R., Zeugmann, T.: On the Parameterised Complexity of Learning Patterns. In: Proc. ISCIS 2011, p 2011Google Scholar

## Copyright information

**Open Access**This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.