On the undecidability and descriptional complexity of synchronized regular expressions

Xie, Jingnan; Hunt, Harry B.

doi:10.1007/s00236-023-00439-3

On the undecidability and descriptional complexity of synchronized regular expressions

Original Article
Open access
Published: 10 April 2023

Volume 60, pages 257–278, (2023)
Cite this article

Download PDF

You have full access to this open access article

Acta Informatica Aims and scope Submit manuscript

On the undecidability and descriptional complexity of synchronized regular expressions

Download PDF

Jingnan Xie¹ &
Harry B. Hunt III²

1426 Accesses
3 Citations
Explore all metrics

Abstract

In Freydenberger (Theory Comput Syst 53(2):159–193, 2013. https://doi.org/10.1007/s00224-012-9389-0), Freydenberger shows that the set of invalid computations of an extended Turing machine can be recognized by a synchronized regular expression [as defined in Della Penna et al. (Acta Informatica 39(1):31–70, 2003. https://doi.org/10.1007/s00236-002-0099-y)]. Therefore, the widely discussed predicate “$=\{0,1\}^*$” is not recursively enumerable for synchronized regular expressions (SRE). In this paper, we employ a stronger form of non-recursive enumerability called productiveness and show that the set of invalid computations of a deterministic Turing machine on a single input can be recognized by a synchronized regular expression. Hence, for a polynomial-time decidable subset of SRE, where each expression generates either $\{0, 1\}^*$ or $\{0, 1\}^* -\{w\}$ where $w \in \{0, 1\}^*$, the predicate “$=\{0,1\}^*$” is productive. This result can be easily applied to other classes of language descriptors due to the simplicity of the construction in its proof. This result also implies that many computational problems, especially promise problems, for SRE are productive. These problems include language class comparison problems (e.g., does a given synchronized regular expression generate a context-free language?), and equivalence and containment problems of several types (e.g., does a given synchronized regular expression generate a language equal to a fixed unbounded regular set?). In addition, we study the descriptional complexity of SRE. A generalized method for studying trade-offs between SRE and many classes of language descriptors is established.

On the Semantics of Atomic Subgroups in Practical Regular Expressions

Descriptional Complexity of Bounded Regular Languages

Star-Freeness, First-Order Definability and Aperiodicity of Structured Context-Free Languages

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

An extension of regular expressions—synchronized regular expressions (SRE)—is defined and studied in [6]. SRE may allow to find if certain subexpressions are repeated the same number of times in a text. This can be useful for integrity checks, especially when mixed with other extensions such as backreferences (as defined in [1]). Della Penna et al. use SRE to present a formal study of the backreferences extension and of a new extension called the synchronized exponents proposed by them. They also study the classification of SRE in the formal language hierarchy and show that SRE are context-sensitive but do not generate all context-free languages. The membership problem for SRE and SRE with certain restrictions is studied in [6] as well. In [5], Carle shows that the language of palindromes cannot be generated by a synchronized regular expression, and the language $\{ww \vert w \in \{0, 1\}^*\}$ is a synchronized regular language. Hence, the class of synchronized regular languages is incomparable with the class of context-free languages.

In [7], Freydenberger shows that the set of invalid computations of an extended Turing machine can be recognized by an extended regular expression (introduced by Câmpeanu et al [4]), hence, by a synchronized regular expression. Therefore, the widely discussed predicate “$=\{0,1\}^*$” is not recursively enumerable for SRE. Moreover, Freydenberger shows that the regularity, cofiniteness, and RegEx(k)-ity (defined in [7]) problems are not recursively enumerable for SRE. More research on extended regular expressions (EXREGs) can be found in [2, 9, 26, 27].

In this paper, we employ a stronger form of non-recursive enumerability called productiveness. A productive set S is not recursively enumerable. Furthermore, for any effective axiomatic system F, there is an effective procedure to construct an element that is in S, but not provable in F (see Sect. 3.1 for precise definitions). Then, we show that the set of invalid computations of a deterministic Turing machine on a single input can be recognized by a synchronized regular expression. Hence, for a polynomial-time decidable subset of SRE, where each expression generates either $\{0, 1\}^*$ or $\{0, 1\}^* -\{w\}$ where $w \in \{0, 1\}^*$, the predicate “$=\{0,1\}^*$” is productive. This special type of universality problem is denoted by “$= \{0,1\}^* \mid _{\vert L^c \vert \le 1}$”. This result can be easily applied to other classes of language descriptors, such as 1-SRE (see Definition 6), one-reversal bounded one-counter machines, and real-time one-way cellular automata (the definition can be found in [20]), due to the simplicity of the construction in its proof. This result also implies the productiveness of many problems for SRE. These problems include:

1.
a variety of equivalence and containment problems such as testing equivalence to any fixed unbounded regular languages,
2.
language class comparison problem which is defined as follows:

For two classes of language descriptors ${\mathcal {D}}_1$ and ${\mathcal {D}}_2$, determine for any $a \in {\mathcal {D}}_1$, whether ${\mathcal {L}}(a) \in {\mathcal {L}}({\mathcal {D}}_2)$?

The general containment problem for pattern languages over fixed alphabets is shown to be undecidable in [8] which can be applied to SRE directly since all pattern languages can be represented by SRE (see [6]). We study the problems of testing equivalence and containment to many fixed languages since these results are stronger and have more practical meanings. For example, the result of testing equivalence to a fixed regular set enables us to show there is no approximating minimization algorithm between SRE and DFA accepting this fixed regular set.

Several authors have investigated the existence and applicability of analogues of Rice’s Theorem for many classes of languages. For example, in [17, 18], sufficient conditions are given for a language predicate to be as hard as the language predicate $=\{0, 1\}^*$. There are five major differences between the previous results in [17, 18] and the results in this paper.

1.
In Theorem 4.5, we show a way to study predicates that are not true for any regular/context-free sets. This is not done in the previous research. Most of the previous results require the language predicates to be true for some regular sets.
2.
Due to the properties of the predicate “$= \{0,1\}^* \mid _{\vert L^c \vert \le 1}$”, most of the results in this paper are applicable to promise problems. For example, Theorem 4.3 states that for a polynomial-time decidable subset of SRE, where each expression is guaranteed to generate a regular set, the predicates are productive.
3.
The previous results are only for regular expressions, context-free grammars, and in some cases, context-sensitive grammars. But, because of the simple construction in Proposition 3, we can easily apply the results of this paper to any class of language descriptors ${{\mathcal {D}}}$ such that ${\mathcal {L}}({{\mathcal {D}}})$ contains the language {x # y $\vert $x, y $\in (\Sigma - \{\#\})^*$ and $\vert x \vert = \vert y \vert \}$ and ${\mathcal {L}}({{\mathcal {D}}})$ is closed under union, and concatenation with regular sets. For example, the results of this paper can be applied to SRE, 1-SRE, one-reversal bounded one-counter machines, and real-time one-way cellular automata.
4.
The previous results cannot be applied to the language class comparison problems between incomparable classes. But the results in this paper enable us to study such problems. For example, we know that the class of synchronized regular languages is incomparable with the class of context-free languages. In Corollary 4, we show that it is productive to determine, for an arbitrary synchronized regular expression, whether it generates a context-free language.
5.
The previous results require the language predicates to be with certain restrictions, such as closed under left or right derivatives. But in Corollary 3, we show that it is productive to determine, for an arbitrary synchronized regular expression, whether it generates a k-pattern language (defined in Definition 3) for any $k \ge 1$. Due to the dichotomization of the reduction in the proof, we do not need any closure property for k-pattern languages.

The second aim of this paper is to study the descriptional complexity of SRE. In the theory of formal languages, questions concerning descriptional complexity are widely discussed. How succinctly can a descriptor generate a language in comparison with some other descriptors generating the same language? It is well-known that for all natural number $n \ge 1$, there exists a regular language accepted by some nondeterministic finite automata (NFA) with n states but every deterministic finite automaton (DFA) accepting the same language has at least $2^n$ states. In [13], Hartmanis shows that there is no recursive trade-off between pushdown automata (PDA) and deterministic pushdown automata (DPDA). In [7], Freydenberger shows that there is no recursive trade-off between SRE and regular expressions. More related research can be found in [14]. In this paper, we study trade-offs between SRE and many language descriptors including DFA, subclasses of regular expressions, and multi-patterns.

Multi-patterns (MP) and multi-pattern languages (MPL) are defined in [19] to exhibit common patterns for a given sample of words. As Della Penna et al. mentioned in [6],

backreferences are a generalization of patterns, i.e., expressions that make reference to the string matched by a previous subexpression,

we believe it is interesting to consider the relationship between SRE and MPL. Several results established in this paper are related to MPL. We prove that it is productive whether a given synchronized regular expression generates a multi-pattern language. In addition, we show that there is no recursive trade-off between SRE and MP.

This paper is organized as follows.

In Sect. 2, we review the definitions of SRE and MP. Several preliminary definitions and notations are also explained.

In Sect. 3.1, the definition and importance of productiveness are discussed. In Sect. 3.2, we show the predicate “$= \{0,1\}^* \mid _{\vert L^c\vert \le 1}$” is productive for SRE.

In Sect. 4, sufficient conditions are given for a language predicate to be as hard as the language predicates “$= \{0,1\}^* \mid _{\vert L^c\vert \le 1}$” for SRE. These conditions yield a method for proving productiveness results through highly efficient many-one reductions. Using this method, we prove many computational problems are productive for SRE.

In Sect. 5, we study the descriptional complexity of SRE and generalize a method for showing non-recursive trade-offs between SRE and many classes of language descriptors.

2 Definitions and notations

In this section, we review the definitions of SRE and MPL from [6] and [19], respectively. Several preliminary definitions and notations are also explained. The reader is referred to [16] for all unexplained notations and terminologies in language theory.

We use $\lambda $ to denote the empty string and $\emptyset $ to denote the empty set. We use ${\mathbb {N}}$ to denote the set of natural numbers. Let ${{\textbf {P}}}$ denote the class of sets that can be recognized in polynomial time by a deterministic Turing Machine. If A is many-one reducible to B, we write $A \leqslant _m B$.

Let REG({0,1}) be the set of $(\cup , \cdot , *)$-regular expressions over language alphabet $\{0,1\}$. Let CFG({0,1}) be the set of context-free grammars over terminal alphabet $\{0,1\}$.

Definition 1

The synchronized regular expressions on an alphabet $\Sigma $, a set of variables V and a set of exponents X are defined as follows:

$\emptyset \in SRE$ (empty set)
$\lambda \in SRE$ (empty string)
$\forall a \in \Sigma : a\in SRE$ (letters)
$\forall v \in V: v \in SRE$ (variables)

If $e_1, e_2 \in SRE$, then:

1.
$e_1^* \in SRE$ (star)
2.
$\forall x \in X: e_1^x \in SRE$ (exponentiation)
3.
$\forall v \in V: e_1\%v \in SRE$ (variable binding)
4.
$e_1e_2 \in SRE$ (concatenation)
5.
$e_1 + e_2 \in SRE$ (union)

$\square $

Beyond these basic syntactic definitions, a synchronized regular expression must meet the following conditions to be considered valid.

Definition 2

The SRE validity test is defined as follows:

1.
Each variable occurs in a binding operation no more than once in the expression.
2.
Each occurrence of a variable in the expression is preceded by a binding of that variable somewhere to the left of the occurrence in the expression.

Throughout this paper, let $\mathbf {SRE(\{0,1\})}$ denote the set of valid synchronized regular expressions over alphabet $\{0,1\}$. $\square $

Unless otherwise specified, any mention of SRE in this paper refers to valid SRE. The following examples are used in later proofs of this paper and can help the readers better understand SRE.

Example 2.1

The synchronized regular expression $0^x1^x$ specifies the language $\{0^n1^n \mid n \ge 0\}$.

Example 2.2

The synchronized regular expression $(0+1)^x \# (0+1)^x$ specifies the language $\{a\#b \mid a,b \in \{0,1\}^*$, $\vert a \vert = \vert b \vert \}$.

Example 2.3

The synchronized regular expression $(0+1)^*\%X\cdot X$ (X is a variable) specifies the language $\{ww \mid w \in \{0,1\}^*\}$.

Definition 3

Let V be an alphabet of variables such that $V \cap \{0, 1\} = \emptyset $. A pattern $\alpha $ is a string over $V \cup \{0, 1\}$. Let ${\mathcal {H}}$ be the set of homomorphisms h where $h: (V \cup \{0, 1\})^* \mapsto (V \cup \{0, 1\})^*$. Then, the language generated by the pattern $\alpha $ is defined as

${\mathcal {L}}(\alpha ) =\{w \in \{0,1\}^* \mid w = h(\alpha )$ for some $h \in {\mathcal {H}}$ such that $h(0) =0$ and $h(1)=1 \}$.

A multi-pattern $\pi $ is a finite set of patterns, $\pi = \{\alpha _1,\alpha _2,\alpha _3,\ldots ,\alpha _n\}$ where $\alpha _i \in (V\cup \{0,1\})^*$ $(1 \le i \le n)$. The language generated by the multi-pattern $\pi $ is

$$\begin{aligned} {\mathcal {L}}(\pi ) = \displaystyle \bigcup _{i=1}^n {\mathcal {L}}(\alpha _i). \end{aligned}$$

For all integer $k \ge 1$, a k-pattern p is a set of patterns of cardinality k. The language generated by the k-pattern p is

$$\begin{aligned} {\mathcal {L}}(p) = \displaystyle \bigcup _{i=1}^k {\mathcal {L}}(\alpha _i). \end{aligned}$$

Throughout this paper, MP({0,1})denotes the set of all multi-patterns over terminal alphabet $\{0,1\}$. $\square $

Example 2.4

The language $\{ww \mid w \in \{0,1\}^*\}$ is a pattern language.

Proof

Consider the pattern $\alpha =xx$ where x is a variable. Since x can be replaced by any string in $\{0, 1\}^*$, it is clear ${\mathcal {L}}(\alpha ) =\{ww \mid w \in \{0,1\}^*\}$. $\square $

Example 2.5

The language $\{0ww \mid w\in \{0,1\}^*\} \cup \{1w \mid w\in \{0,1\}^*\}$ is a multi-pattern language but not a pattern language.

Proof

It is not hard to see that no single pattern can specify this language but the multi-pattern $\pi =\{0xx, 1x\}$ specifies this language. $\square $

Example 2.6

The simple regular language $\{0\}^* \cdot \{1\}^*$ is not a multi-pattern language.

Proof

The proof can be found in [19]. $\square $

Let ${\mathcal {D}}$ be a class of language descriptors that describe languages over $\Sigma $. In this paper, we only consider finite $\Sigma $. Then, $\forall d \in {{\mathcal {D}}}$, ${\mathcal {L}}(d)$ = {$w \in \Sigma ^* \mid w$ is described by d} and ${\mathcal {L}}({{\mathcal {D}}})$ = {$L \subseteq \Sigma ^* \mid \exists d\in {{\mathcal {D}}}$ such that $L = {\mathcal {L}}(d)$}. $\forall d \in {{\mathcal {D}}}$, let $\vert d \vert $ denote the size of d and $<d>$ denote a code of d.

Our code of a language descriptor is efficient and described informally below.

1.
For a regular expression, synchronized regular expression, or multi-pattern, the code is itself.
2.
For a context-free grammar with n nonterminals, nonterminals are denoted by $s_u$ where u is a base 10 numeral without leading 0’s representing integers $\{0, 1,\ldots ,n-1\}$. Each production $A \rightarrow B$ is denoted by a pair (A, B).
3.
For a Turing machine with a set of states Q and tape alphabet T, states are denoted by $q_u$ where u is a base 10 numeral without leading 0’s representing integers $\{0, 1,\ldots ,\vert Q \vert -1\}$. Each machine move $\delta (q,a)=(q',a',d)$ where $q,q' \in Q$, $a,a' \in T$ and $d \in \{L,R\}$ is denoted by a 5-tuple $(q,a,q',a',d)$. For other types of automata, we have similar rules. We use tuples to denote machine moves and $q_u$ to denote states.

The size of a DFA is the number of states of the DFA. The size of a pattern is the number of symbols of the pattern. The size of a context-free grammar is the number of symbols of all its productions. For example, the following context-free grammar d accepts the language $\{0, 1\}^*$. $d = (\{s_1\}, \{0, 1\}, \{(s_1, 0s_1), (s_1, 1s_1), (s_1, \lambda )\}, s_1)$. The size of d is 8 (denoted by $\vert d \vert = 8$).

Comparing two classes of language descriptors ${{\mathcal {D}}}_1$ and ${{\mathcal {D}}}_2$, we assume that ${\mathcal {L}}({{\mathcal {D}}}_1) \cap {\mathcal {L}}({{\mathcal {D}}}_2)$ is not finite. We say that a function $f: {\mathbb {N}} \mapsto {\mathbb {N}}$ where $f(n) \ge n$ is an upper bound for the trade-off between ${{\mathcal {D}}}_1$ and ${{\mathcal {D}}}_2$ when transforming from a minimal descriptor in ${{\mathcal {D}}}_1$ for an arbitrary language to an equivalent minimal descriptor in ${{\mathcal {D}}}_2$, if for all $L \in {\mathcal {L}}({{\mathcal {D}}}_1) \cap {\mathcal {L}}({{\mathcal {D}}}_2)$ the following holds:

$$\begin{aligned} Min\{\vert d \vert \mid d \in {{\mathcal {D}}}_2, {\mathcal {L}}(d)=L\} \le f(Min\{ \vert d \vert \mid d \in {{\mathcal {D}}}_1, {\mathcal {L}}(d)=L\}). \end{aligned}$$

If no recursive function is an upper bound for the trade-off between ${{\mathcal {D}}}_1$ and ${{\mathcal {D}}}_2$, we say the trade-off between ${{\mathcal {D}}}_1$ and ${{\mathcal {D}}}_2$ is non-recursive.

3 Productiveness and the predicate “ $=\{0,1\}^*$ ” for SRE

Section 3 consists of the two Sects. 3.1 and 3.2. Section 3.1 consists of the definition of productiveness, a stronger form of non-recursive enumerability, and two propositions that can be used to prove productiveness results. Section 3.2 consists of Theorem 3.1, which shows for a polynomial-time recognizable subset $D'$ of ${\textbf {SRE}}(\{0, 1\})$, such that $\forall d \in D'$, ${\mathcal {L}}(d) \subseteq \{0,1\}^*$ and $ \vert \{0,1\}^*-{\mathcal {L}}(d) \vert \le 1$, the set $\{<d> \mid d \in D'$, ${\mathcal {L}}(d) =\{0, 1\}^*\}$ is productive, hence, non-recursively enumerable.

3.1 Productiveness

Productive sets and their properties are a standard topic in mathematical logic/recursion theory textbooks such as [25] and [28]. Productiveness is a recursion-theoretic abstraction of what causes G$\ddot{o}$del’s first incompleteness theorem to hold. Definition 4 recalls the definition of a productive set on ${\mathbb {N}}$, as developed in [25].

Definition 4

Let W be an effective G$\ddot{o}$del numbering of the recursively enumerable sets. A set A of natural numbers is called $productive $ if there exists a total recursive function f so that for all $i \in {\mathbb {N}}$, if $W_i \subseteq A$, then $f(i) \in A-W_i$. The function f is called the $productive function $ for A. $\square $

From this definition, we can see that no productive set is recursively enumerable. It is well-known that the set of all provable sentences in an effective axiomatic system is always a recursively enumerable set. So for any effective axiomatic system F, if a set A of G$\ddot{o}$del numbers of true sentences in F is productive, then there is at least one element in A which is true but cannot be proven in F. Moreover, there is an effective procedure to produce such an element.

Let W be an effective G$\ddot{o}$del numbering of the recursively enumerable sets. K denotes the set $\{ i \in {\mathbb {N}} \mid i \in W_i \}$. $\overline{{{\textbf {K}}}}$ denotes the set $\{ i \in {\mathbb {N}} \mid i \not \in W_i \}$. Two well-known facts of productive sets (see [25]) that are necessary for the research developed here are as follows:

Proposition 1

1.
$\overline{{{\textbf {K}}}}$ is productive.
2.
For all $A \subseteq {\mathbb {N}}$, A is productive if and only if $\overline{{{\textbf {K}}}}\le _{m} A$.

$\square $

Let $\Sigma $, $\Delta $ be two different finite alphabets such that both $A \subseteq \Sigma ^{*}$ and $A \subseteq \Delta ^{*}$. It is easily seen that

There exists a total recursive function $\textit{F}: {\mathbb {N}} \rightarrow \Sigma ^{*}$ such that $\overline{{{\textbf {K}}}}\le _{m} A$ (via F) if and only if there exists a total recursive function $\textit{G}: {\mathbb {N}} \rightarrow \Delta ^{*}$ such that $\overline{{{\textbf {K}}}}\le _{m} A$ (via G).

Hence, language A is productive for some finite alphabet $\Sigma $ such that $A \subseteq \Sigma ^{*}$ if and only if A is productive for all finite alphabets $\Delta $ such that $A \subseteq \Delta ^{*}$. This is the sense in which the concepts of productiveness are independent of particular finite alphabets.

The following proposition is used to prove many productiveness results for SRE. It also shows in which way the productiveness is stronger than non-recursive enumerability, i.e., every productive set A has an infinite recursively enumerable subset, and for any sound proof procedure P, one can effectively construct an element that is in A, but not provable in P.

Proposition 2

Let $A \subseteq \Sigma ^{*}$, $B \subseteq \Delta ^{*}$, and $A \le _{m} B$. Then, the following holds:

1.
If A is productive, then so is B.
2.
If A is productive, then there exists a total recursive function $\Psi :\Sigma ^{*} \rightarrow \Sigma ^{*}$, called a productive function for A, such that for all $x \in \Sigma ^{*}$,

${\mathcal {L}}(M_{x}) \subseteq A \Rightarrow \Psi (x) \in A - {\mathcal {L}}(M_{x})$, where {$M_{x} \mid x \in \Sigma ^{*}$} is some G$\ddot{o}$del-numbering of Turing machines over alphabet $\Sigma $.
3.
If A is productive, then A is not recursively enumerable (RE). However, A does have an infinite RE subset.

$\square $

Proof of 1: By Proposition 1, if A is productive, then $\overline{{{\textbf {K}}}}\le _{m} A$. Hence by the transitivity of the many-one reducibility, $\overline{{{\textbf {K}}}}\le _{m} B$. Hence by Proposition 1, B is also productive.

Proof of 2: Let the natural numbers be represented in unary. Let $\overline{{{\textbf {K}}}}\le _{m} A$ (via F). Then, there exists a total recursive function $g: \Sigma ^{*} \rightarrow \{1\}^{*} $ such that, for all $x \in \Sigma ^*$,

${\mathcal {L}}(M_{g(x)}) = \textit{F}^{-1}({\mathcal {L}}(M_x))$.

The proof of the existence of function g can be seen in Theorem V(a) [25] page 84. Let the function $\Psi : \Sigma ^* \rightarrow \Sigma ^*$ be defined by, for all $x \in \Sigma ^*$, $\Psi (x) = \textit{F}(g(x))$. The function $\Psi $ is a total recursive function since it is the composition of two total recursive functions with appropriate domains and ranges. The function $\Psi $ is actually a productive function for A. This is seen as follows. Let $x \in \Sigma ^*$; and suppose that ${\mathcal {L}}(M_x) \subseteq A$. Then, ${\mathcal {L}}(M_{g(x)}) \subseteq \textit{F}^{-1}(A) \subseteq \overline{{{\textbf {K}}}}$. By the productive property of $\overline{{{\textbf {K}}}}$ using productive function ${{\textbf {I}}}_{\{1\}^*}$, $g(x) \in \overline{{{\textbf {K}}}}- {\mathcal {L}}(M_{g(x)})$. Hence, $\Psi (x) = \textit{F}(g(x)) \in A$. But $\Psi (x) \not \in {\mathcal {L}}(M_x)$, since otherwise,

$\Psi (x) = \textit{F}(g(x)) \Rightarrow g(x) \in \textit{F}^{-1}({\mathcal {L}}(M_x)) = {\mathcal {L}}(M_{g(x)})$,

contradicting, $g(x) \in \overline{{{\textbf {K}}}}- {\mathcal {L}}(M_{g(x)})$. Hence, as was to be verified,

$\Psi (x) \in A - {\mathcal {L}}(M_x)$.

Proof of 3: Since $\overline{{{\textbf {K}}}}$ is not RE and $\overline{{{\textbf {K}}}}\le _m A$ by assumption, A is also not RE. The remainder of the proof is essentially the same as that of Theorem X [25] pages 90–91. It is given here, for the convenience of the reader.

Let $\Psi : \Sigma ^* \rightarrow \Sigma ^*$ be a productive total recursive function for A. A total recursive function $g: {\mathbb {N}} \rightarrow \Sigma ^*$ can be computed inductively as follows. Let $x_0$ be some G$\ddot{o}$del index for $\Phi $. Then, $\Phi = {\mathcal {L}}(M_{x_0}) \in A$; and hence, $\Psi (x_0) \in A - {\mathcal {L}}(M_{x_0})$. Let $g(0) = \Psi (x_0)$. To compute $g(n+1)$, do the following. Let $x_{n+1}$ be a G$\ddot{o}$del index, for the finite set $\{g(0),\ldots , g(n)\} \subseteq A$. Let $g(n+1) = \Psi (x_{n+1})$. Then ${\mathcal {L}}(M_{x_{n+1}}) \subseteq A$; and hence, $g(n+1) = \Psi (x_{n+1}) \in A - \{g(0),\ldots , g(n)\}$. Since the function g as defined is one-to-one, the set $\{g(n) \mid n \ge 0\}$ is an infinite RE subset of A. $\square $

3.2 The predicate “ $=\{0,1\}^*$ ” for SRE

To make our results stronger and more applicable, we first study the sets of valid and invalid computations of Turing machines. Unlike the definition stated in [12] and [16], we define the sets of valid and invalid computations of Turing machines on given inputs. This refined definition enables us to investigate the complexity/undecidability of the restricted language predicate:

testing equivalence to $\{0,1\}^*$ for languages whose complements’ cardinalities are less than or equal to one (denoted by “$= \{0,1\}^* \mid _{\vert L^c \vert \le 1}$”).

The instances of this restricted predicate have very important semantic properties: they are the simplest regular sets. These restrictions make the predicate more widely applicable: for example, they directly apply to promise problems, predicates on regular sets, and descriptional complexity of language descriptors.

Throughout this section, $M=(Q,\Sigma , T,\delta , q_0, B, F)$ is a single tape deterministic Turing machine where:

1.
Q is M’s nonempty finite set of states;
2.
$q_0 \in Q$ is M’s unique start state;
3.
$F \subseteq Q$ is M’s set of accepting states. Each one in F is final;
4.
M’s input alphabet is $\Sigma $ and T is M’s tape alphabet where $\Sigma \subseteq T$;
5.
$B \in T$ is the blank symbol;
6.
$\delta :((Q-F) \times T) \mapsto (Q \times T \times \{L, R\}) $ is the transition function where L is the left shift and R is the right shift; and
7.
$\Delta _M$ = $T \cup (Q \times T) \cup \{ \#\}$ where the sets T, $(Q \times T)$ and $\{\#\}$ are pairwise disjoint. $\Delta '_M$ = $\Delta _M- \{\#\}$

Definition 5

Let M be any fixed deterministic Turing machine. For all $w \in \Sigma ^+$, letting $w=w_1w_2w_3\ldots w_k$ where $w_j \in \Sigma $ $(1 \le j\le k)$, the set of valid computations of M on w denoted by VALCM(w), is the set of strings of the form $\#id_0\#id_1\#id_2\cdot \cdot \cdot \#id_n\#$ such that

1.
each $id_i$ $(1 \le i \le n)$ is an ID^{Footnote 1} of M
2.
$id_0 = (q_0, w_1)w_2w_3\ldots w_k$ is the initial ID of M on w
3.
$id_n$ is a final ID
4.
$id_i \vdash _M id_{i+1}$^{Footnote 2} for $0 \le i < n$

The set of invalid computations of M on w denoted by INVALCM(w), is the complement of VALCM(w) with respect to $\Delta _M^*$.

We write $a \vdash _M bc$ where $a,b,c \in \Delta '_M$ if and only if $a \in (Q \times T)$ is the rightmost letter of an ID, $\delta (a)=(q_i,b,R)$ and $c =(q_i, B)$.

We write $ab \vdash _M c$ where $a,b,c \in \Delta '_M$ if and only if a is the leftmost letter of an ID and

1.
if $a, b \not \in (Q \times T)$, then $c =a$;
2.
if $a \in (Q \times T)$ and $\delta (a) =(q_i,t,L)$, then $c=(q_i, B)$;
3.
if $a \in (Q \times T)$ and $\delta (a) =(q_i,t,R)$, then $c=t$;
4.
if $b \in (Q \times T)$ and $\delta (b) =(q_i,t,L)$, then $c=(q_i,a)$;
5.
if $b \in (Q \times T)$ and $\delta (b) =(q_i,t,R)$, then $c = a$;

Or, b is the rightmost letter of an ID and

1.
if $a, b \not \in (Q \times T)$, then $c =b$;
2.
if $a \in (Q \times T)$ and $\delta (a) =(q_i,t,L)$, then $c=b$;
3.
if $a \in (Q \times T)$ and $\delta (a) =(q_i,t,R)$, then $c=(q_i, b)$;
4.
if $b \in (Q \times T)$ and $\delta (b) =(q_i,t,L)$, then $c=t$.

We write $abc \vdash _M d$ where $a,b,c,d \in \Delta '_M$ if and only if abc is an infix of an ID and

1.
if $a,b,c \not \in (Q \times T)$, then $d =b$;
2.
if $a \in (Q \times T)$ and $\delta (a) =(q_i,t,R)$, then $d=(q_i, b)$;
3.
if $a \in (Q \times T)$ and $\delta (a) =(q_i,t,L)$, then $d=b$;
4.
if $c \in (Q \times T)$ and $\delta (a) =(q_i,t,L)$, then $d=(q_i, b)$;
5.
if $c \in (Q \times T)$ and $\delta (c) =(q_i,t,R)$, then $d=b$;
6.
if $b \in (Q \times T)$ and $\delta (b) =(q_i,t,L)$ or $\delta (b) =(q_i,t,R)$, then $d=t$.

$\square $

Intuitively, the notation $abc \vdash _M d$ means three consecutive letters of an ID determine one letter of the next ID. By checking every three consecutive letters of $id_i$, if the corresponding letter of $id_{i+1}$ is always the correct one (i.e., $abc \vdash _M d$ is true for every three consecutive letters of $id_i$), we know $id_i \vdash _M id_{i+1}$. The notations $a \vdash _M bc$ and $ab \vdash _M c$ are used to handle the boundary cases. For example, $a \vdash _M bc$ means the head of Turing machine M is scanning the rightmost letter of the input, rewriting it to b, and moving to the right. So, $c = (q, B)$ where q is a state of M and B is the blank symbol. Since M is a deterministic Turing machine, VALCM(w) only contains one single string when M accepts w; otherwise VALCM(w) is the empty set. Hence, INVALCM(w) is either $\Delta _M^*$ or $\Delta _M^* - \{t\}$ where $t \in \Delta _M^*$. The following proposition shows that the class of synchronized regular languages contains INVALCM(w) very efficiently. The intuitive explanation of the languages $L_1$ through $L_5$ in the following proposition can be found later in the proof.

Proposition 3

1.
$INVALCM(w) = L_1 \cup L_2 \cup L_3 \cup L_4 \cup L_5$, where the language $L_j$ $(1 \le j \le 5)$ is defined as follows:
$$\begin{aligned} L_1= & {} \Delta ^*_M- \{\#\}\cdot (T^*\cdot (Q \times T)\cdot T^*\{\#\})^+\\ L_2= & {} \Delta ^*_M-\Delta ^*_M\cdot \{\#\}\cdot T^*\cdot (F \times T)\cdot T^* \cdot \{\#\}\\ L_3= & {} \{\lambda \} \cup ((\Delta _M-\{\#\}) \cup \{\#\}\cdot ((\Delta _M-\{(q_0,w_1)\}) \cup \{(q_0,w_1)\}\\{} & {} \cdot \, ((\Delta _M-\{w_2\}) \cup \cdot \cdot \cdot \cup \{w_{k-1}\} \\{} & {} \cdot \, ((\Delta _M -\{w_k\}) \cup \{w_k\} \cdot \Delta '_M) \cdot \cdot \cdot )))\cdot \Delta ^*_M\\ L_4= & {} \Delta ^*_M \cdot \Delta '_M \cdot \{x \# y \mid x,y \in \Delta '^*_M\hbox { and } \vert x \vert = \vert y \vert \}\cdot \{\#\}\cdot \Delta ^*_M\\{} & {} \qquad \qquad \cup \\{} & {} \,\Delta ^*_M \cdot \{\#\} \cdot \{x \# y \mid x,y \in \Delta '^*_M \hbox { and } \vert x \vert = \vert y \vert \}\cdot \Delta '_M\cdot \Delta '_M \cdot \Delta ^*_M\\ L_5= & {} L_{5.1} \cup L_{5.2} \cup L_{5.3} \cup L_{5.4}\hbox { where}\\ L_{5.1}= & {} \displaystyle \bigcup _{\begin{array}{c} a,b,c\in \Delta '_M \\ a\not \in (Q \times T)\\ \hbox { or }a \not \vdash _M bc \end{array}} \Delta ^*_M\cdot \{\#\}\cdot \{ua \# vbc \mid u,v \in \Delta '^*_M\hbox { and } \vert u \vert = \vert v \vert \}\cdot \{\#\}\cdot \Delta ^*_M\\ L_{5.2}= & {} \displaystyle \bigcup _{\begin{array}{c} a,b,c \in \Delta '_M \\ ab \not \vdash _M c \end{array}} \Delta ^*_M \cdot \{ \# ab\}\cdot \Delta '^*_M \cdot \{\#c\}\cdot \Delta ^*_M\\ L_{5.3}= & {} \displaystyle \bigcup _{\begin{array}{c} a,b,c,d,e\in \Delta '_M \\ ab \not \vdash _M c\hbox { or }\\ b \vdash _M de \end{array}} \Delta ^*_M\cdot \{\#\}\cdot \{uab \# vc \mid u,v \in \Delta '^*_M\hbox { and } \vert u \vert = \vert v \vert -1\}\cdot \{\#\}\cdot \Delta ^*_M\\ L_{5.4}= & {} \displaystyle \bigcup _{\begin{array}{c} a,b,c,d\in \Delta '_M \\ abc \not \vdash _M d \end{array}} \Delta ^*_M\cdot \{\#uabcw \# vd \mid u,v,w \in \Delta '^*_M\hbox { and }\vert u \vert = \vert v \vert -1 \}\cdot \Delta ^*_M \end{aligned}$$
2.
The languages $L_1$, $L_2$ and $L_3$ are regular sets. The language $L_1$ and $L_2$ depend only on M. There exists a regular expression $N_{M,w}$ such that ${\mathcal {L}}(N_{M,w}) = L_3$ and $N_{M,w}$ is constructible from w deterministically in time O($ \vert w \vert log \vert w \vert $).
3.
$L_4$ and $L_5$ depend only on M and can be generated by synchronized regular expressions.
4.
A synchronized regular expression e is constructible deterministically from w in time O($ \vert w \vert log \vert w \vert $) such that ${\mathcal {L}}(e)=INVALCM(w)$.

Proof of 1: The proof of $L_1 \cup L_2 \cup L_3 \cup L_4 \cup L_5 \subseteq INVALCM(w)$ is straightforward by the definition of INVALCM(w).

All strings in $L_1$ are not of the form $\#id_0\#id_1\#id_2\cdot \cdot \cdot \#id_n\#$ where $id_i(1 \le i \le n)$ is an ID of M.

All strings in $L_2$ do not end with $id_n\#$ where $id_n$ is a final ID of M.

All strings in $L_3$ do not start with $\#id_0$ where $id_0$ is the initial ID of M on w

Every string in $L_4$ has an infix $\# x \# y\#$ where $x,y \in \Delta '^*_M$ such that $\vert x \vert > \vert y \vert $ or $\vert y \vert - \vert x \vert >1$.

Every string in $L_5$ has an infix $\# x \# y\#$ where $x,y \in \Delta '^*_M$ such that $x \not \vdash _M y$.

$L_{5.1}$ covers the case when $|x| = |y| -1$, an error causing $x \not \vdash _M y$ shows at the rightmost part of x and y.

$L_{5.2}$ covers the case that an error causing $x \not \vdash _M y$ shows at the leftmost part of x and y.

$L_{5.3}$ covers the case when $\vert x \vert = \vert y \vert $, an error causing $x \not \vdash _M y$ shows at the rightmost part of x and y.

$L_{5.4}$ covers the case that an error causing $x \not \vdash _M y$ shows at the middle of x and y.

The proof of $INVALCM(w) \subseteq L_1 \cup L_2 \cup L_3 \cup L_4 \cup L_5$:

$\forall t \in INVALCM(w)$, there are only two possibilities:

1.
t is not of the form $\#id_{k_1}\#id_{k_2}\#id_{K_3}\cdot \cdot \cdot \#id_{k_n}\#$ where $id_{k_i}(1 \le i \le n)$ is an ID of M. Then $t \in L_1$.
2.
t is of the form above. There are only two possibilities:
1. (a)
  t does not end with $id_n\#$ where $id_n$ is a final ID of M. Then, $t \in L_2$.
2. (b)
  t ends with $id_n\#$. Then, there are only two cases:
  1. (i)
    t does not start with $\#id_0$ where $id_0$ is the initial ID of M on w. Then, $t \in L_3$.
  2. (ii)
    t starts with $\#id_0$. Then, let $t=\#id_0\#id_1\#id_2\cdot \cdot \cdot \#id_n\#$.
    
    $t \in INVALCM(w) \Rightarrow \exists $ a leftmost $i(0 \le i <n)$ such that $id_i \not \vdash _M id_{i+1}$.
    
    If $\vert id_i \vert > \vert id_{i+1} \vert $ or $\vert id_{i+1} \vert -\vert id_i \vert >1$, then $t \in L_4$.
    
    Otherwise, $\exists x \in \Delta '_M$ which is the leftmost error in $id_{i+1}$ so that $id_i \not \vdash _M id_{i+1}$.
    
    If $\vert id_i \vert < \vert id_{i+1} \vert $
    1. (A)
      x is the first letter of $id_{i+1}$, then $t \in L_{5.2}$
    2. (B)
      x is one of the last two letters of $id_{i+1}$, then $t \in L_{5.1}$
    3. (C)
      otherwise, $t \in L_{5.4}$
    If $\vert id_i \vert = \vert id_{i+1} \vert $
    1. (A)
      x is the first letter of $id_{i+1}$, then $t \in L_{5.2}$
    2. (B)
      x is the last letter of $id_{i+1}$, then $t \in L_{5.3}$
    3. (C)
      otherwise, $t \in L_{5.4}$

Proof of 2, 3, and 4: From the definition of $L_3$, we can see that the regular expression $N_{M, w}$ accepting $L_3$ contains $O(\vert w \vert )$ parentheses. Hence, we need $O(\vert w \vert log \vert w \vert )$ time to encode and count these parentheses. From Definition 1, SRE languages are closed under union and concatenation efficiently. It is obvious that every regular language is an SRE language. From Example 2.2, it is easy to see that we can efficiently construct an SRE to specify the language $L_1 \cup L_2 \cup L_3 \cup L_4$. Now, we give a synchronized regular expression to specify a language that is central to the specification of $L_5$. For simplicity, assume the input alphabet of the Turing machine $\Sigma =\{0, 1\}$. For any $a,b,c,d \in \{0,1\}$, the SRE $(0+1)^xabc(0+1)^*\#(0+1)^x(0+1)d$ specifies the language $\{uabcw\#vd \mid u,w,v \in \{0,1\}^*$, $\vert u \vert = \vert v \vert -1\}$ over language alphabet $\{0, 1, \#\}$. From this synchronized regular expression, we can construct a synchronized regular expression to specify $L_5$ by concatenation with regular sets and union with SRE languages. Hence, we can efficiently construct a synchronized regular expression to specify $L_1 \cup L_2 \cup L_3 \cup L_4 \cup L_5$. $\square $

By this proposition and the following theorem, we show that even for a polynomial-time recognizable subset $D'$ of $\mathbf {SRE(\{0,1\})}$ where each element in $D'$ generates either $\{0,1\}^*$ or $\{0,1\}^*-\{w\}$ $(w \in \{0,1\}^*)$, the predicate “$= \{0,1\}^*$” is already productive. This means the predicate “$= \{0,1\}^* \mid _{\vert L^c \vert \le 1}$” is not recursively enumerable for $\mathbf {SRE(\{0,1\})}$, independent of the complexity of testing whether an instance is in $D'$. Results of this type occur throughout this paper and have many applications, especially for promise problems. Moreover, since synchronized regular expressions are recursive language descriptors, the predicate “$ \not = \{0,1\}^*$” is recursively enumerable.

It is worth noticing that the languages $L_1$ through $L_5$ in Proposition 3 are very simple languages. So we can easily apply the results of this paper to any class of language descriptors ${{\mathcal {D}}}$ such that ${\mathcal {L}}({{\mathcal {D}}})$ contains the language {x # y $\vert $x, y $\in (\Sigma - \{\#\})^*$ and $\vert x \vert = \vert y \vert \}$ and ${\mathcal {L}}({{\mathcal {D}}})$ is closed under union, and concatenation with regular sets. For example, the results of this paper can be applied to one-reversal bounded one-counter machines, and real-time one-way cellular automata (defined in [20]).

Theorem 3.1

There exists a subset $D'$ of $\mathbf {SRE(\{0,1\})}$ such that

1.
$D' \in $ P;
2.
$\forall d \in D'$, ${\mathcal {L}}(d) \subseteq \{0,1\}^*$ and $\vert \{0,1\}^*-{\mathcal {L}}(d) \vert \le 1$; and
3.
$\overline{{{\textbf {K}}}}\le _m \{<d> \mid d \in D'$, ${\mathcal {L}}(d) =\{0, 1\}^*\}$

$\square $

Proof of 2, 3: It is not hard to see we can efficiently code INVALCM(w) into alphabet $\{0, 1\}$. According to Proposition 3, a synchronized regular expression e is constructible deterministically in time O($\vert w \vert log \vert w \vert $) to accept the coded INVALCM(w). Let $D'$ be the set of all possible e. Since M is a deterministic Turing machine, we know $\vert {\mathcal {L}}(e)^c \vert \le 1$ and ${\mathcal {L}}(e) =\{0,1\}^*$ if and only if M does not accept w.

Proof of 1: Let e be constructed in a certain way to accept $L_1,\ldots ,L_5$ so that e has a special format. For example, e must contain 5 easily separable sub-expressions such that the first sub-expression accepts $L_3$, and the remaining sub-expressions accept $L_1, L_2, L_4, L_5$ in this exact order. Since $L_1, L_2, L_4$, and $L_5$ only depend on the fixed Turing machine M, one can determine the input w of M from e in polynomial time in $\vert e \vert $ by reading the first sub-expression of e. So for any synchronized regular expression $e_d$, if one cannot determine an input w from $e_d$, then $e_d \not \in D'$. Otherwise, one can determine w according to the format of $e_d$ and construct a synchronized regular expression $e_w$ from w and M (M is fixed) in polynomial time so it accepts the coded INVALCM(w). Make sure that $e_w$ must contain 5 easily separable sub-expressions such that the first sub-expression accepts $L_3$, and the remaining sub-expressions accept $L_1, L_2, L_4, L_5$ in this exact order. $e_d \in D'$ if and only if $e_d =e_w$. This shows that $D' \in {\textbf{P}}$. $\square $

Della Penna et al. also introduced a proper subclass of SRE, namely the 1-level or “flat” SRE in [6]. 1-SRE are a yet useful but much less complex subclass of SRE. Definition 6 review the definition of a 1-level synchronized regular expression. From the proof of Proposition 3 and Theorem 3.1, it is not hard to see corollary 1 holds.

Definition 6

[6] 1-level synchronized regular expressions(1-SRE) are SRE where variables and exponents cannot be nested (i.e., variables and exponents cannot appear inside an exponentiated expression or in the expression that is bound to a variable). $\square $

Corollary 1

The predicate “$= \{0,1\}^* \mid _{\vert L^c \vert \le 1}$” is productive for 1-SRE. $\square $

Proof

In Example 2.2 and the proof of Proposition 3, the synchronized regular expressions we present are 1-SRE. A 1-SRE language concatenation with a regular set or union with a regular set is still a 1-SRE language. A 1-SRE language union with a 1-SRE language is still a 1-SRE language. So we can construct a 1-level synchronized regular expression to accept $L_1 \cup L_2 \cup L_3 \cup L_4 \cup L_5$ defined in Proposition 3. $\square $

Therefore, all the results for SRE in this paper hold for 1-SRE.

4 Language predicates for SRE

In this section, we show that many important language predicates are as hard as the predicate “$= \{0,1\}^* \mid _{\vert L^c \vert \le 1}$” for SRE. This section consists of three major theorems. Theorem 4.1 shows the productiveness of testing equivalence and containment to any fixed unbounded regular set for SRE. Theorem 4.3 gives widely applicable sufficient conditions for proving productiveness results for SRE. One condition of Theorem 4.3 is that the language predicates need to be true for only one regular set $\{0, 1\}^*$. Theorem 4.5 shows how to prove productiveness results for predicates that are not true for any regular/context-free languages by giving two interesting examples related to multi-pattern languages.

The following definition from [15] is necessary for Theorem 4.1.

Definition 7

A regular set $R_0 \subseteq \{0,1\}^*$ is unbounded if and only if there exist strings $r,s,x,y \in \{0,1\}^*$ such that $R_0 \supseteq \{r\}\cdot \{0x,1y\}^*\cdot \{s\}$. $\square $

Theorem 4.1

Let $R_0$ be any fixed unbounded regular set over $\{0,1\}$. There exists a subset $S'$ of $\mathbf {SRE(\{0,1\})}$ such that

1.
$S' \in $ P;
2.
$\forall d \in S'$, $\vert R_0-{\mathcal {L}}(d) \vert \le 1$;
3.
$\overline{{{\textbf {K}}}}\le _m \{<d> \mid d \in S'$, ${\mathcal {L}}(d) = R_0\}$; and
4.
$\overline{{{\textbf {K}}}}\le _m \{<d> \mid d \in S'$, ${\mathcal {L}}(d) \supseteq R_0\}$.

$\square $

Proof

The proof is similar to that used in [17] to show that the predicate “$=L_0$” is undecidable for context-free grammars where $L_0$ is any fixed context-free language with unbounded regular subset. Since $R_0$ is unbounded, from Lemma 7, there exist $r,s,x,y \in \{0,1\}^*$ such that $\{r\}\cdot \{0x,1y\}^* \cdot \{s\} \subseteq R_0$. $\forall e_1 \in D'$ where $D'$ is defined in Theorem 3.1, we can efficiently construct a synchronized regular expression $e_2$ such that

$$\begin{aligned} {\mathcal {L}}(e_2)= & {} \{r\}\cdot h({\mathcal {L}}(e_1))\cdot \{s\}\\{} & {} \qquad \qquad \cup \\{} & {} R_0 \cap \overline{\{r\}\cdot \{0x,1y\}^*\cdot \{s\}} \end{aligned}$$

where $h: \{0,1\}^* \mapsto \{0,1\}^*$ is the homomorphism defined by $h(0) = 0x$ and $h(1) = 1y$. For any $e_1 \in D'$, we can construct $e_2$ in polynomial time in $\vert e_1 \vert $ since $R_0,x,y,s$ and r are fixed constants. Let $S'$ be the set of $e_2$. $D' \in {\textbf{P}} \Rightarrow S' \in {\textbf{P}}$. If ${\mathcal {L}}(e_1) =\{0,1\}^*$, then ${\mathcal {L}}(e_2) =R_0$; otherwise, ${\mathcal {L}}(e_1) =\{0,1\}^* -\{w\}$. Hence, ${\mathcal {L}}(e_2) = R_0 - \{rh(w)s\}$. $\square $

Theorem 4.1 shows that for any fixed unbounded regular set $R_0$, the predicates “$=R_0$” and “$ \supseteq R_0$” are productive even for a polynomial-time recognizable subset of $\mathbf {SRE(\{0,1\})}$ where each element generates either $R_0$ or $R_0 - \{w\}$ ($w \in \{0, 1\}^*$). We believe this result has significant practical meanings since in reality, finding an approximation that differs from $R_0$ by a finite set is often done and very interesting.

The proof of Theorem 4.1 can be easily applied to any fixed language $L_0$ with an unbounded regular subset as long as the language $L_0 \cap \overline{\{r\}\cdot \{0x,1y\}^*\cdot \{s\}}$ can be generated by a synchronized regular expression. Here, we give an example to show that Theorem 4.1 works for many non-regular languages. Extended regular expressions (EXREGs) are introduced by Câmpeanu et al [4] and are closed under intersection with regular sets [3]. It is easy to see that SRE contain EXREGs effectively since variable bindings can function as backreferences (defined in [4]). Hence, for any language $L_0$ generated by an extended regular expression, the language $L_0 \cap \overline{\{r\}\cdot \{0x,1y\}^*\cdot \{s\}}$ can be generated by a synchronized regular expression. Hence, we can get the following corollary.

Corollary 2

Let $L_0$ be any extended regular language (defined in [4]) over {0, 1} with an unbounded regular subset. The predicates “=$L_0$” and “$\supseteq L_0$” are productive for $\mathbf {SRE(\{0,1\})}$. $\square $

It may be practically more relevant to ask for an approximating minimization algorithm between SRE and DFA/CFG/EXREGs, i.e., given a synchronized regular expression e, finding a DFA/CFG/EXREG accepting ${\mathcal {L}}(e)$ whose size is bounded by f(M) where $f: {\mathbb {N}} \rightarrow {\mathbb {N}}$ is a recursive function and M is the size of a minimal DFA/CFG/EXREG accepting ${\mathcal {L}}(e)$. The results of testing equivalence and containment to some fixed language $L_0$ also enable us to show there is no approximating minimization algorithm between SRE and DFA/CFG/EXREGs accepting $L_0$. For simplicity, we only show the following theorem for the case $L_0 = \{0, 1\}^*$.

Theorem 4.2

Let $f: {\mathbb {N}} \rightarrow {\mathbb {N}}$ be a recursive function. Then, there is no algorithm for solving the f-bounded approximating minimization problem between SRE and DFA/CFG/EXREGs. $\square $

Proof

We only prove there is no algorithm for solving the approximating minimization problem between SRE and CFG. For DFA, the proof is easier since the universality problem is decidable for DFA. For EXREGs, the proof is similar. Assume there is an algorithm for solving the approximating minimization problem between SRE and CFG. Then, for any synchronized regular expression generating a context-free language $L_0$, one can find an equivalent CFG of size K accepting $L_0$, such that $K \le f(M)$ where M is the size of a minimal CFG accepting $L_0$. In this case, let $L_0 = \{0, 1\}^*$. We know the size of a minimal CFG accepting $\{0, 1\}^*$ is 8 (see the example in Sect. 2). Hence, for any synchronized regular expression e, we can run this algorithm and find an equivalent CFG d. For any context-free grammar with n nonterminals, nonterminals are denoted by $s_u$ where u is a base 10 numeral without leading 0’s representing integers $\{0, 1,\ldots ,n-1\}$. If $\vert d \vert > f(8)$, then ${\mathcal {L}}(e) \not = \{0, 1\}^*$. Otherwise, $\vert d \vert \le f(8)$. There exists a finite set T such that for any context-free grammar p where $\vert p \vert \le f(8)$, $p \in T$. Hence, there exists a finite table telling if ${\mathcal {L}}(p) = \{0, 1\}^*$, for all $p \in T$. Since the table is finite, it is decidable to check which $p = d$ and if ${\mathcal {L}}(p) = \{0, 1\}^*$. if ${\mathcal {L}}(d) = \{0, 1\}^*$, then ${\mathcal {L}}(e) = \{0, 1\}^*$; otherwise, ${\mathcal {L}}(e) \not = \{0, 1\}^*$. This shows the universality problem is decidable for SRE, which is a contradiction. $\square $

To better describe Theorem 4.3, we introduce the following notations. For any predicate $\Pi $ on a class of languages over $\Sigma $, let $\Pi _{left}$ denote the set $\{L \subseteq \Sigma ^* \mid \exists L'$ where $\Pi (L') = true$ and $\exists a \in \Sigma ^*$, such that $L=a{\setminus } L' \}$. Let $\Pi _{right}$ denote the set $\{L \subseteq \Sigma ^* \mid \exists L'$ where $\Pi (L') = true$ and $\exists a \in \Sigma ^*$, such that $L= L'/a \}$. The notations $a {\setminus } L$ and L/a denote left and right quotients with a single letter, respectively.

Theorem 4.3

Let $\Pi $ be any non-trivial predicate on the regular sets, such that

1.
$\Pi (\{0,1\}^*) = true$ and
2.
${\mathcal {L}}(\mathbf {REG(\{0,1\})}) -\Pi _{left} \not = \emptyset $ or ${\mathcal {L}}(\mathbf {REG(\{0,1\})}) -\Pi _{right} \not = \emptyset $

Then, there exists a subset $S'$ of $\mathbf {SRE(\{0,1\})}$ such that

1.
$S' \in $ P;
2.
$\forall d \in S'$, ${\mathcal {L}}(d)$ is regular; and
3.
$\overline{{{\textbf {K}}}}\le _m \{<d> \mid d \in S'$, $\Pi ({\mathcal {L}}(d)) = true\}$, hence, testing if the predicate $\Pi $ is true for SRE is productive.

$\square $

Proof

The proof is similar to that used in [18] which shows the undecidability of many predicates on context-free languages which are true for $\{0,1\}^*$. We only prove when ${\mathcal {L}}(\mathbf {REG(\{0,1\})}) -\Pi _{left} \not = \emptyset $, the theorem holds. The other part of the proof is very similar. Since ${\mathcal {L}}(\mathbf {REG(\{0,1\})}) -\Pi _{left} \not = \emptyset $, there exists a regular language $L_f$ such that $L_f \notin \Pi _{left}$. Then, $\forall e_1 \in D'$ where $D'$ is defined in Theorem 3.1, we can efficiently construct a synchronized regular expression $e_2$ such that

$$\begin{aligned} {\mathcal {L}}(e_2)= & {} h({\mathcal {L}}(e_1))\cdot \{11\}\cdot \{0,1\}^*\\{} & {} \qquad \qquad \cup \\{} & {} \{00,01\}^*\cdot \{11\}\cdot L_f\\{} & {} \qquad \qquad \cup \\{} & {} \overline{\{00,01\}^*\cdot \{11\}\cdot \{0,1\}^*} \end{aligned}$$

where $h: \{0,1\}^* \mapsto \{0,1\}^*$ is the homomorphism defined by $h(0) = 00$ and $h(1) = 01$. Let $S'$ be the set of $e_2$. Since $L_f$ is a fixed regular set, we can determine $e_1$ from $e_2$ in polynomial time in $\vert e_2 \vert $. $D' \in {\textbf{P}} \Rightarrow S' \in {\textbf{P}}$. If ${\mathcal {L}}(e_1) =\{0,1\}^*$, then ${\mathcal {L}}(e_2) =\{0,1\}^*$. Hence, $\Pi ({\mathcal {L}}(e_2))=true$; otherwise, ${\mathcal {L}}(e_1) =\{0,1\}^* -\{w\}$. Hence, ${\mathcal {L}}(e_2)$ is regular and $h(w)11 \setminus {\mathcal {L}}(e_2) =L_f$. According to the definition of $\Pi _{left}$, $L_f \notin \Pi _{left} \Rightarrow \Pi ({\mathcal {L}}(e_2))=false$. $\square $

Theorem 4.3 is extremely useful for proving productiveness results of language class comparison problems for SRE with a promise that each synchronized regular expression is guaranteed to generate a regular language. We illustrate the power and applicability of Theorem 4.3.

Theorem 4.4

The following predicates on the regular sets over {0, 1} satisfy the conditions of Theorem 4.3, i.e., for each of the following predicates, testing if the predicate is true for SRE is productive.

1.
L is a star event, i.e., $L = (L)^*$;
2.
L is a code event, i.e., there exist strings $w_1,\ldots ,w_k \in \{0, 1\}^*$ such that $L = \{w_1,\ldots , w_k\}^*$;
3.
For all $k \ge 1$, L is a k-parsable event; and L is a locally parsable event;
4.
L is an ultimate definite event, reverse ultimate definite event, or generalized ultimate definite event;
5.
L is a comet event, reverse comet event, or generalized comet event;
6.
$L = \gamma (L)$, where $\gamma (L) = \{y \mid $ there exists x in L such that $\mid y \mid = \mid x \mid \}$;
7.
L is prefix closed, i.e., $L=\{x \mid $ there exists y in $\{0, 1\}^*$ and $x \cdot y \in L\}$;
8.
L is suffix closed, i.e., $L=\{y \mid $ there exists x in $\{0, 1\}^*$ and $x \cdot y \in L\}$;
9.
L is infix closed, i.e., $L=\{y \mid $ there exists x, z in $\{0, 1\}^*$ and $x \cdot y \cdot z \in L\}$;
10.
L is co-finite;
11.
For all $k \ge 1$, L is a k-definite event, k-reverse definite event, or k-generalized definite event;
12.
L is definite, reverse definite, or generalized definite event;
13.
For all $k \ge 1$, L is a k-testable event;
14.
For all $k \ge 1$, L is k-testable in the strict sense;
15.
L is locally testable in the strict sense;
16.
L is locally testable;
17.
L is a star-free, non-counting, group-free, permutation-free, or LTO event;
18.
For all $k > 2$, L is a CMk event;
19.
L is accepted by some strongly connected deterministic finite automaton;
20.
L is accepted by some permutation automaton;
21.
L is a pure group event;
22.
$L = L^{rev}$; and
23.
L is dot-free, i.e., L is denoted by some $(\cup , \cdot , *, -)$ regular expression over {0, 1} with no occurrence of “$\cdot $”;

$\square $

Proof

The definitions of the classes of regular sets of 2, 3, and 11 through 18 can be found in [22]. The definition of 4 can be found in [23], 5 in [24], 19 in [11], 20 in [29], and 21 in [21]. The proof for each of the above predicates consists of two parts. The first part consists of observing that the predicates are true for $\{0, 1\}^*$. The second part of the proof consists of showing that $\Pi _{left}$ or $\Pi _{right}$ is a proper subset of the regular sets which can be found in [18]. $\square $

Since Theorem 4.3 only requires the predicates to be true for one regular set $\{0, 1\}^*$, we can use its proof to study the complexity/undecidability of predicates on many classes of languages other than the regular sets. Two interesting examples are the following corollaries of the proof of Theorem 4.3. Due to the power and applicability of this proof and the dichotomization of its reduction, we can show Corollary 3 despite that MPL is an anti-AFL (anti-abstract family of languages, see [10] for definition) and k-pattern languages may not be closed under left or right derivatives ( [19]), and any class of languages $\Gamma \subseteq {\mathcal {L}}(\mathbf {CFG(\{0,1\})})$ that contains $\{0, 1\}^*$ satisfies Corollary 4 regardless of the closure properties of $\Gamma $.

Corollary 3

$\{<d> \mid d \in \mathbf {SRE(\{0,1\})}$, ${\mathcal {L}}(d)$ is a k-pattern language for any $k \ge 1\}$ is productive, and $\{<d> \mid d \in \mathbf {SRE(\{0,1\})}$, ${\mathcal {L}}(d)$ is a multi-pattern language } is productive. $\square $

Proof

Let $L_f =\{0^n1^n \mid n \ge 0\}$. We know that $L_f$ is an SRE language by Example 2.1. We also know that $L_f$ is not a multi-pattern language and MPL is closed under left and right derivatives [19]. Hence, with the same construction in the proof of Theorem 4.3, if ${\mathcal {L}}(e_1)=\{0,1\}^*$, then ${\mathcal {L}}(e_2)=\{0,1\}^*$ which is a 1-pattern language. Otherwise, there exists a string $w \in \{0,1\}^+$ such that $w {\setminus } {\mathcal {L}}(e_2)=L_f$. Since $L_f$ is not a multi-pattern language and MPL languages are closed under left derivatives, ${\mathcal {L}}(e_2)$ is not a multi-pattern language. Hence, ${\mathcal {L}}(e_2)$ is not a k-pattern language for any $k \ge 1$. $\square $

Corollary 4

For any fixed set $\Gamma $ where $\Gamma \subseteq {\mathcal {L}}(\mathbf {CFG(\{0,1\})})$ and $\{0,1\}^* \in \Gamma $,

$\{<d> \mid d \in \mathbf {SRE(\{0,1\})}$, ${\mathcal {L}}(d) \in \Gamma \}$ is productive.

Thus, in particular,

$\{<d> \mid d \in \mathbf {SRE(\{0,1\})}$, ${\mathcal {L}}(d)$ is context-free $\}$, and

$\{<d> \mid d \in \mathbf {SRE(\{0,1\})}$, ${\mathcal {L}}(d)$ is regular $\}$ are productive. $\square $

Proof

Let $L_f =\{ww \mid w \in \{0,1\}^*\}$. The SRE $(0+1)^*\%X\cdot X$ (X is a variable) specifies $L_f$. So with the same construction in the proof of Theorem 4.3, if ${\mathcal {L}}(e_1)=\{0,1\}^*$, then ${\mathcal {L}}(e_2)=\{0,1\}^*$. Hence, ${\mathcal {L}}(e_2) \in \Gamma $. Otherwise, there exists a string $w \in \{0,1\}^+$ such that $w \setminus {\mathcal {L}}(e_2)=L_f$. Since $L_f$ is not context-free and context-free languages are closed under left derivatives, ${\mathcal {L}}(e_2)$ is not context-free. Hence, ${\mathcal {L}}(e_2) \not \in \Gamma $. $\square $

So far, all language class comparison problems we have studied need to be true for only one regular set $\{0,1\}^*$. The following theorem illustrates how we can investigate the complexity/undecidability of predicates that are not true for any regular set, or any context-free languages.

Theorem 4.5

$\overline{{{\textbf {K}}}}\le _m \{<d> \mid d \in \mathbf {SRE(\{0,1\})}$, ${\mathcal {L}}(d)$ is not regular but in ${\mathcal {L}}({\textbf {MP(\{0,1\})}}) \}$, and $\overline{{{\textbf {K}}}}\le _m \{<d> \mid d \in \mathbf {SRE(\{0,1\})}$, ${\mathcal {L}}(d)$ is not context-free but in ${\mathcal {L}}({\textbf {MP(\{0,1\})}}) \}$. $\square $

Proof

Let $L_t =\{1\}\cdot \{ww \mid w \in \{0,1\}^*\} \cup \{0\} \cdot \{0,1\}^*$ and $L_f =\{0\}^* \cdot \{1\}^*$. Consider the multi-pattern $\pi =\{1xx, 0x\}$ where x is a variable. Then, ${\mathcal {L}}(\pi ) = L_t$. Hence, $L_t$ is a multi-pattern language. It is easy to see that $L_t$ is not a context-free language. $\forall e_1 \in \mathbf {SRE(\{0,1\})}$, we can effectively construct a synchronized regular expression $e_2$ such that

$$\begin{aligned} {\mathcal {L}}(e_2)= & {} \{0\} \cdot h({\mathcal {L}}(e_1))\cdot \{11\}\cdot \{0,1\}^*\\{} & {} \qquad \qquad \cup \\{} & {} \{0\} \cdot \{00,01\}^*\cdot \{11\}\cdot L_f\\{} & {} \qquad \qquad \cup \\{} & {} L_t \cap \overline{\{0\} \cdot \{00,01\}^*\cdot \{11\}\cdot \{0,1\}^*} \end{aligned}$$

where $h: \{0,1\}^* \mapsto \{0,1\}^*$ is the homomorphism defined by $h(0) = 00$ and $h(1) = 01$. Extended regular expressions (EXREGs) are introduced by Câmpeanu et al [4] and are closed under intersection with regular sets [3]. The language $\{ww \mid w \in \{0,1\}^*\}$ can be specified by an extended regular expression $((0+1)^*) \setminus 1$ ($\setminus 1$ is a backreference used to match the same content as a previously matched subexpression). Hence, $L_t$ can be specified by an EXREG. Hence, $L_t \cap \overline{\{0\} \cdot \{00,01\}^*\cdot \{11\}\cdot \{0,1\}^*}$ can be specified by an EXREG. It is easy to see that SRE contain EXREGs effectively since variable bindings can function as backreferences. Hence, $L_t \cap \overline{\{0\} \cdot \{00,01\}^*\cdot \{11\}\cdot \{0,1\}^*}$ can be specified by a synchronized regular expression. Since $L_t$, $L_f$ and $\overline{\{0\} \cdot \{00,01\}^*\cdot \{11\}\cdot \{0,1\}^*}$ are fixed languages, the construction of $e_2$ can be done in polynomial time in $\vert e_1 \vert $. If ${\mathcal {L}}(e_1) =\{0,1\}^*$, since $\{0\} \cdot \{0,1\}^* \subseteq L_t$, ${\mathcal {L}}(e_2) = L_t$. Hence, ${\mathcal {L}}(e_2)$ is a multi-pattern language but not context-free. Otherwise, $\exists w \in \{0,1\}^*$ such that $0\,h(w)11 {\setminus } {\mathcal {L}}(e_2) = L_f$. MPL is closed under left and right derivatives [19] $\Rightarrow {\mathcal {L}}(e_2)$ is not a multi-pattern language. $\square $

5 Descriptional complexity of SRE

In this section, we study the descriptional complexity of SRE using the special properties of the predicate “$= \{0,1\}^* \mid _{\vert L^c \vert \le 1}$”. Regular languages are the most commonly used formal languages. In [7], Freydenberger shows that there is no recursive trade-off between SRE and regular expressions. But it may be practically more relevant to ask the trade-off between SRE and a class of language descriptors accepting a particular subset of regular languages. The special properties of the predicate “$= \{0,1\}^* \mid _{\vert L^c \vert \le 1}$” enable us to study such trade-offs since the language $\{0, 1\}^*$ and $\{0, 1\}^* -\{w\}$ are both co-finite and are the simplest regular languages. In Theorem 5.1, we show that there is no recursive trade-off between SRE and DFA. Then, Theorem 5.2 generalizes the proof of Theorem 5.1 and gives sufficient conditions for establishing non-recursive trade-offs between SRE and many classes of language descriptors. To illustrate the power and applicability of Theorem 5.2, we show that any class of language descriptors accepting languages satisfying any predicate listed in Theorem 4.4 satisfies the conditions of Theorem 5.2.

To show that our results are even more widely applicable, we tune Theorem 5.2 with slight changes to study the trade-off between SRE and multi-patterns. Here, we redefine INVALCM(w) developed in Definition 5 to tune the conditions in Theorem 5.2 so they can fit the properties of multi-patterns (Lemmas 5.2 and 5.3). Intuitively, Definition 5 defines INVALCM(w) as either $\{0, 1\}^*$ or $\{0, 1\}^* -\{w\}$. The redefined INVALCM(w) in Definition 8 defines INVALCM(w) as either $\{0, 1\}^*$ or $\{0, 1\}^* -\{w\} \cdot \{0, 1\}^*$. Lemma 5.2 shows that $\{0, 1\}^* -\{w\} \cdot \{0, 1\}^*$ is a multi-pattern language. Lemma 5.3 shows that for any multi-pattern $\pi $ generating $\{0, 1\}^* -\{w\} \cdot \{0, 1\}^*$, $\vert \pi \vert \ge \vert w \vert -1$. With the tuned conditions, we establish Theorem 5.3 to show there is no recursive trade-off between SRE and multi-patterns. This is another example to show Proposition 3 is easily applicable.

The following lemma for DFA is well-known and needed for proving Theorem 5.1.

Lemma 5.1

$\forall w \in \{0,1\}^*$, let $L_w =\{0,1\}^*-\{w\}$. For any DFA M that accepts $L_w$, $\vert M \vert \ge \vert w \vert $. $\square $

Theorem 5.1

There exists a subset $D'$ of SRE({0,1}) such that

1.
$D' \in $ P;
2.
$\forall d \in D'$, ${\mathcal {L}}(d)$ is regular;
3.
There exists no recursive function $f:{\mathbb {N}} \mapsto {\mathbb {N}}$ such that $\forall d \in D'$, for any minimal DFA M accepting ${\mathcal {L}}(d)$, $\vert M \vert \le f(\vert d \vert )$; and
4.
There exists a fixed constant $C>0$ such that

$\{<d> \mid d \in D'$, $\exists $ a DFA M such that ${\mathcal {L}}(M) ={\mathcal {L}}(d)$ and $\vert M \vert <C\}$ is productive, hence, not recursively enumerable.

$\square $

Proof of 1, 2, and 3: Let $D'$ be the same $D'$ defined in Theorem 3.1. We know that every language in ${\mathcal {L}}(D')$ is either $\{0, 1\}^*$, or $\{0, 1\}^* -\{t\}$ where $t \in \{0, 1\}^*$ is the coded valid computation of a Turing machine N on an input w. Assume such a recursive function f stated in 3 exists. From Lemma 5.1, we know that $f(\vert d \vert ) \ge \vert M \vert \ge \vert t \vert $. If the Turing machine N on w halts, let x denote the number of steps N takes to halt. Then, $\vert t \vert \ge x\cdot \vert w \vert $. Hence, $f(\vert d \vert ) >x$. Hence, the halting problem is recursive, which is a contradiction.

Proof of 4: Let $C=2$. For any $d \in D'$, if ${\mathcal {L}}(d) =\{0,1\}^*$, then there exists a DFA M with a single state such that ${\mathcal {L}}(M) ={\mathcal {L}}(d)$. Hence, $\vert M \vert < C$; otherwise, ${\mathcal {L}}(d)=\{0,1\}^*-\{t\}$ where $t \in \{0,1\}^+$, then from Lemma 5.1, for any DFA M specifying ${\mathcal {L}}(d)$, $\vert M \vert \ge \vert t \vert $. Therefore, it is clear $\vert M \vert \ge C$. $\square $

We establish the following theorem to generalize Theorem 5.1. Many non-recursive trade-offs between SRE and other classes of language descriptors can be proved using it.

Theorem 5.2

Let ${{\mathcal {D}}}$ be any class of language descriptors over alphabet $\{0,1\}$ such that

1.
For any $w\in \{0,1\}^*$, $L_w =\{0,1\}^*-\{w\} \in {\mathcal {L}}({{\mathcal {D}}})$; and
2.
There exists a strictly increasing recursive function $f:{\mathbb {N}} \mapsto {\mathbb {N}}$ such that for any $d \in {{\mathcal {D}}}$ specifying $L_w$, $f(\vert d \vert ) > \vert w \vert $.

Then, there is no recursive trade-off between SRE and ${{\mathcal {D}}}$. $\square $

Proof

Assume there exists a recursive function $g:{\mathbb {N}} \mapsto {\mathbb {N}}$ such that for any synchronized regular expression e specifying the coded INVALCM(w) (notice that we can efficiently code INVALCM(w) into alphabet $\{0, 1\}$ and $\vert INVALCM(w)^c \vert \le 1$), for any $d \in {{\mathcal {D}}}$ specifying ${\mathcal {L}}(e)$, $g(\vert e \vert ) >\vert d \vert $. Since f is strictly increasing, we know $f(g(\vert e \vert ))> f(\vert d \vert ) > \vert t \vert $ where $t \in VALCM(w)$. Clearly, the function $f \circ g$ remains a recursive function. $\vert t \vert >x$ where x is the number of steps that the Turing machine N takes to halt on w. Hence, $f(g(\vert e \vert )) >x$. Hence, the halting problem is recursive, which is a contradiction. $\square $

The following corollary illustrates the power and applicability of Theorem 5.2.

Corollary 5

Any class of language descriptors ${{\mathcal {D}}}$ that ${\mathcal {L}}({{\mathcal {D}}})$ satisfies any predicate listed in Theorem 4.4 satisfies the conditions of Theorem 5.2, i.e., there is no recursive trade-off between SRE and ${{\mathcal {D}}}$.

It is also interesting for us to investigate the trade-off between SRE and multi-patterns. To illustrate that our results are tunable and widely applicable, we slightly change the definition of INVALCM(w) as follows to investigate this problem. Intuitively, the redefined INVALCM(w) is either $\{0, 1\}^*$ or $\{0, 1\}^* -\{w\} \cdot \{0, 1\}^*$.

Definition 8

Recall the deterministic Turing machine $M=(Q, \Sigma , T,\delta , q_0,B,F)$ we mentioned in Sect. 3. For all $w \in \Sigma ^+$, letting $w=w_1w_2w_3\ldots w_k$ where $w_j \in \Sigma $ $(1 \le j\le k)$, the set of valid computations of M on w denoted by $VALCM'(w)$, is the set of strings of the form $\#id_0\#id_1\#id_2\cdot \cdot \cdot \#id_n\# \cdot t$ such that

1.
$t \in \Delta _M^*$
2.
each $id_i$ $(1 \le i \le n)$ is an ID of M
3.
$id_0 = (q_0, w_1)w_2w_3\ldots w_k$ is the initial ID of M on w
4.
$id_n$ is a final ID
5.
$id_i \vdash _M id_{i+1}$ for $0 \le i < n$

The set of invalid computations of M on w denoted by $INVALCM'(w)$ is the complement of $VALCM'(w)$ with respect to $\Delta _M^*$. $\square $

With this refined definition and the following two lemmas, we can study the trade-off between SRE and multi-patterns using a method similar to Theorem 5.2.

Lemma 5.2

$\forall w \in \{0,1\}^+$, the language $L_w =\{0,1\}^*-\{w\}\cdot \{0,1\}^*$ is a multi-pattern language. $\square $

Proof

Let $w = w_0w_1w_2\cdot \cdot \cdot w_k$ where $w_i \in \{0, 1\}(0 \le i \le k)$. Let ${\bar{0}} =1$ and ${\bar{1}}=0$. Consider the multi-pattern $\pi =\{\lambda $, $w_0$, $w_0w_1$,..., $w_0w_1\ldots w_{k-1}$, $\bar{w_0}x$, $ w_0\bar{w_1}x$, $w_0w_1\bar{w_2}x$,..., $w_0w_1w_2\ldots w_{k-1}\bar{w_k}x\}$ where x is a variable. It is clear that ${\mathcal {L}}(\pi ) = L_w$. $\square $

Lemma 5.3

For any multi-pattern $\pi $ that generates the language $L_w =\{0,1\}^*-\{w\}\cdot \{0,1\}^*$ where $w \in \{0,1\}^+$, $\vert \pi \vert \ge \vert w \vert -1$. $\square $

Proof

Assume $\vert \pi \vert < \vert w \vert -1$. Let $w_L$ be the longest proper prefix of w. Then, we know $\vert w_L \vert =\vert w \vert -1$ and $w_L \in L_w$. Hence, there is a pattern $\alpha \in \pi $ such that $w_L \in {\mathcal {L}}(\alpha )$. According to the assumption, $\vert \alpha \vert <\vert w_L \vert $ since otherwise $\vert \pi \vert \ge \vert \alpha \vert \ge \vert w \vert -1$ which is a contradiction. Hence, there must be at least one variable occurring in $\alpha $.

Let x be the leftmost variable occurred in $\alpha $ and V be the set of variables for $\pi $. Let $\alpha = P_1xP_2$ where $P_1 \in \{0,1\}^*$ and $P_2 \in (\{0,1\} \cup V)^*$. Since $w_L \in {\mathcal {L}}(\alpha )$, we know $P_1$ is a proper prefix of w. Hence, $\exists t \in \{0,1\}^*$ such that $P_1\cdot t =w$.

Let x be substituted by t, then there exists a string in $\{w\}\cdot \{0,1\}^*$ that matches $\alpha $, which is a contradiction. $\square $

Theorem 5.3

There exists a subset $S'$ of SRE({0,1}) such that

1.
$S' \in $ P;
2.
$\forall d \in S'$, ${\mathcal {L}}(d)$ is a multi-pattern language;
3.
There exists no recursive function $f:{\mathbb {N}} \mapsto {\mathbb {N}}$ such that $\forall d \in S'$, for any minimal multi-pattern $\pi $ specifying ${\mathcal {L}}(d)$, $\vert \pi \vert \le f(\vert d \vert )$; and
4.
There exists a fixed constant $C>0$ such that

$\{<d> \mid d \in S'$, $\exists $ a multi-pattern $\pi $ such that ${\mathcal {L}}(\pi ) ={\mathcal {L}}(d)$ and $\vert \pi \vert <C\}$ is productive, hence, not recursively enumerable.

$\square $

Proof of 1, 2, and 3: Corresponding to Definition 8, $L_1$ through $L_5$ in Proposition 3 need to be changed slightly. Let $\Delta _1 = \Delta _M-(F \times T) -\{\#\}$ and $\Delta _2 = \Delta _M-(F \times T)$. Let

$$\begin{aligned} L_1'= & {} \Delta ^*_M- \{\#\}\cdot (T^*\cdot (Q \times T)\cdot T^*\{\#\})^+\cdot \Delta _{M}^*\\ L_2'= & {} \Delta ^*_M-\Delta ^*_M\cdot \{\#\}\cdot T^*\cdot (F \times T)\cdot T^* \cdot \{\#\}\cdot \Delta _M^*\\ L_3'= & {} \{\lambda \} \cup ((\Delta _M-\{\#\}) \cup \{\#\}\cdot ((\Delta _M-\{(q_0,w_1)\}) \cup \{(q_0,w_1)\}\\{} & {} \cdot \, ((\Delta _M-\{w_2\}) \cup \cdot \cdot \cdot \cup \{w_{k-1}\} \cdot ((\Delta _M -\{w_k\}) \cup \{w_k\} \cdot \Delta '_M) \cdot \cdot \cdot )))\cdot \Delta ^*_M\\ L_4'= & {} \Delta ^*_2 \cdot \Delta '_1 \cdot \{x \# y \mid x\in \Delta _1^*\hbox { and }y \in \Delta '^*_M, \vert x \vert = \vert y \vert \}\cdot \{\#\}\cdot \Delta ^*_M\\{} & {} \qquad \qquad \cup \\{} & {} \cdot \, \Delta ^*_2 \cdot \{\#\} \cdot \{x \# y \mid x \in \Delta _1^*\hbox { and } y \in \Delta '^*_M, \vert x \vert = \vert y \vert \}\cdot \Delta '_M\cdot \Delta '_M \cdot \Delta ^*_M\\ L_5'= & {} L_{5.1}' \cup L_{5.2}' \cup L_{5.3}' \cup L_{5.4}'\hbox { where }\\ L_{5.1}'= & {} \displaystyle \bigcup _{\begin{array}{c} a,b,c\in \Delta '_M \\ a\not \in (Q \times T)\\ \hbox { or }a \not \vdash _M bc \end{array}} \Delta ^*_2\cdot \{\#\}\cdot \{ua \# vbc \mid u\in \Delta _1^*, v \in \Delta '^*_M\hbox { and }\vert u \vert = \vert v \vert \}\cdot \{\#\}\cdot \Delta ^*_M\\ L_{5.2}'= & {} \displaystyle \bigcup _{\begin{array}{c} a,b,c \in \Delta '_M \\ ab \not \vdash _M c \end{array}} \Delta ^*_2 \cdot \{ \# ab\}\cdot \Delta ^*_1 \cdot \{\#c\}\cdot \Delta ^*_M\\ L_{5.3}'= & {} \displaystyle \bigcup _{\begin{array}{c} a,b,c,d,e\in \Delta '_M \\ ab \not \vdash _M c\hbox { or }\\ b \vdash _M de \end{array}} \Delta ^*_2\cdot \{\#\}\cdot \{uab \# vc \mid u \in \Delta _1^*, v \in \Delta '^*_M\hbox { and }\vert u \vert =\vert v \vert -1 \}\cdot \{\#\}\cdot \Delta ^*_M\\ L_{5.4}'= & {} \displaystyle \bigcup _{\begin{array}{c} a,b,c,d\in \Delta '_M \\ abc \not \vdash _M d \end{array}} \Delta ^*_2\cdot \{\#uabcw \# vd \mid u,w \in \Delta _1^*, v \in \Delta '^*_M\hbox { and }\vert u \vert =\vert v \vert -1 \}\cdot \Delta ^*_M \end{aligned}$$

Since all states in F are final, for any string $s=\#id_0\#id_1\#id_2\cdot \cdot \cdot \#id_n\# \cdot t \in VALCM'(w)$, $id_n$ is the first ID that contains a letter in $(F \times T)$. So we can use this property to distinguish which part of s belongs to $\#id_0\#id_1\#id_2\cdot \cdot \cdot \#id_n\#$ and which part of s belongs to t. Thus, it is not hard to see that $INVALCM'(w) = L_1' \cup L_2' \cup L_3' \cup L_4' \cup L_5'$ and $INVALCM'(w)$ can be expressed by a synchronized regular expression. Recall the set $D'$ defined in Theorem 3.1. We can slightly change the set $D'$ to $S'$ so L(S’) is the set of coded INVALCM’(w) over {0, 1}. The proof is very similar to the proof of Theorem 3.1. From Lemma 5.2, we know that every language in ${\mathcal {L}}(S')$ is a multi-pattern language. Assume such a recursive function f stated in 3 exists. From Lemma 5.3, we know that $f(\vert d \vert ) \ge \vert \pi \vert \ge \vert t \vert -1$ where $t \in VALCM(w)$. If the Turing machine M on w halts, let x denote the number of steps M takes. Hence, $\vert t \vert \ge x\cdot \vert w \vert $. Hence, $f(\vert d \vert ) >x$. Therefore, the halting problem is recursive, which is a contradiction.

Proof of 4: Let $C=2$. For any $d \in S'$, if ${\mathcal {L}}(d) =\{0,1\}^*$, then there exists a multi-pattern $\{x\}$ where x is a variable and ${\mathcal {L}}(\{x\}) ={\mathcal {L}}(d)$; otherwise ${\mathcal {L}}(d)=\{0,1\}^*-\{u\}\cdot \{0,1\}^*$ where $u \in \{0,1\}^+$, then from Lemma 5.3, for any multi-pattern $\pi $ specifying ${\mathcal {L}}(d)$, $\vert \pi \vert \ge \vert u \vert -1$. It is clear $\vert \pi \vert \ge 2$. $\square $

6 Conclusion

In this paper, productiveness is employed, which is a stronger form of non-recursive enumerability. If a language predicate is productive, then the predicate is not recursively enumerable. Moreover, for the given language predicate, there is an effective procedure which, given as input a program enumerating an effective axiomatic system which proves only true values of the language predicate, produces a statement which cannot be proven by the given axiomatic system but which is still true with respect to the language predicate. We have revised the definition of the sets of invalid computations of Turing machines and used this definition to show that the predicate “$= \{0,1\}^* \mid _{\vert L^c \vert \le 1}$” is productive for SRE. This result enables us to establish the productiveness of many problems for SRE, especially promise problems. Using the special properties of the predicate “$= \{0,1\}^* \mid _{\vert L^c \vert \le 1}$”, non-recursive trade-offs between SRE and many language descriptors are also proved.

Notes

The definition of an ID can be seen in [16].
$\vdash _M$ represents a move of M. The definition of $id_i \vdash _M id_{i+1}$ can be seen in [16].

References

Aho, A.V.: Algorithms for finding patterns in strings. In: Van Leeuwen, J. (ed.) Handbook of Theoretical Computer Science Vol A: Algorithms and Complexity, pp. 255–300. Elsevier, Amsterdam (1990). https://doi.org/10.1016/B978-0-444-88071-0.50010-2
Chapter Google Scholar
Berglund, M., Van der Merwe, B.: Re-examining regular expressions with backreferences. Theoret. Comput. Sci. 940, 66–80 (2023). https://doi.org/10.1016/j.tcs.2022.10.041
Article MathSciNet MATH Google Scholar
Câmpeanu, C., Santean, N.: On the intersection of regex languages with regular languages. Theor. Comput. Sci. 410(24), 2336–2344 (2009). https://doi.org/10.1016/j.tcs.2009.02.022
Article MathSciNet MATH Google Scholar
Câmpeanu, C., Salomaa, K., Yu, S.: A formal study of practical regular expressions. Int. J. Found. Comput. Sci. 14(6), 1007–1018 (2003). https://doi.org/10.1142/S012905410300214X
Article MathSciNet MATH Google Scholar
Carle, B., Narendran, P.: On extended regular expressions. In: Dediu, A.H., Ionescu, A.M., Martín-Vide, C. (eds.) Language and Automata Theory and Applications. LATA 2009. Lecture Notes in Computer Science, vol. 5457, pp. 279–289. Springer, Berlin (2009). https://doi.org/10.1007/978-3-642-00982-2_24
Chapter Google Scholar
Della Penna, G., Intrigila, B., Tronci, E., Venturini Zilli, M.: Synchronized regular expressions. Acta Informatica 39(1), 31–70 (2003). https://doi.org/10.1007/s00236-002-0099-y
Article MathSciNet MATH Google Scholar
Freydenberger, D.D.: Extended regular expressions: succinctness and decidability. Theory Comput. Syst. 53(2), 159–193 (2013). https://doi.org/10.1007/s00224-012-9389-0
Article MathSciNet MATH Google Scholar
Freydenberger, D.D., Reidenbach, D.: Bad news on decision problems for patterns. Inf. Comput. 208(1), 83–96 (2010). https://doi.org/10.1016/j.ic.2009.04.002
Article MathSciNet MATH Google Scholar
Freydenberger, D.D., Schmid, M.L.: Deterministic regular expressions with back-references. J. Comput. Syst. Sci. 105, 1–39 (2019). https://doi.org/10.1016/j.jcss.2019.04.001
Article MathSciNet MATH Google Scholar
Ginsburg, S., Greibach, S.: Abstract families of languages. In: Studies in Abstract Families of Languages, pp. 1–32. Memoirs of the American Mathematical Society, Providence (1969). https://doi.org/10.1090/memo/0087
Hartmanis, J., Stearns, R.E.: Algebraic Structure Theory of Sequential Machines (Prentice-Hall International Series in Applied Mathematics). Prentice-Hall Inc (1966)
Hartmanis, J.: Context-free languages and Turing machine computations. J. Symb. Logic 37(4), 759 (1972). https://doi.org/10.2307/2272443
Article Google Scholar
Hartmanis, J.: On the succinctness of different representations of languages. In: Maurer, H.A. (ed.) Automata, Languages and Programming: Sixth Colloquium. ICALP 1979. Lecture Notes in Computer Science, vol. 71, pp. 282–288. Springer, Berlin, Heidelberg (1979). https://doi.org/10.1007/3-540-09510-1_22
Chapter Google Scholar
Holzer, M., Kutrib, M.: Descriptional complexity—an introductory survey. In: Martín-Vide, C. (ed.) Scientific Applications of Language Methods. Mathematics, Computing, Language, and Life: Frontiers in Mathematical Linguistics and Language Theory, vol. 2, pp. 1–58. World Scientific, Singapore (2010). https://doi.org/10.1142/9781848165458_0001
Hopcroft, J.E., Ullman, J.D.: Formal Languages and Their Relation to Automata. Addison-Wesley Longman Publishing Co. Inc., Boston (1969)
MATH Google Scholar
Hopcroft, J.E., Ullman, J.D.: Introduction to Automata Theory, Languages, and Computation. Addison-Wesley, Reading (1979)
MATH Google Scholar
Hunt, H.B., III., Rosenkrantz, D.J.: On equivalence and containment problems for formal languages. J. ACM 24(3), 387–396 (1977). https://doi.org/10.1145/322017.322020
Article MathSciNet MATH Google Scholar
Hunt, H.B., III., Rosenkrantz, D.J.: Computational parallels between the regular and context-free languages. SIAM J. Comput. 7(1), 99–114 (1978). https://doi.org/10.1137/0207007
Article MathSciNet MATH Google Scholar
Kari, L., Mateescu, A., Pǎun, G., Salomaa, A.: Multi-pattern languages. Theoret. Comput. Sci. 141(1), 253–268 (1995). https://doi.org/10.1016/0304-3975(94)00087-Y
Article MathSciNet MATH Google Scholar
Malcher, A.: Descriptional complexity of cellular automata and decidability questions. J. Autom. Lang. Comb. 7(4), 549–560 (2002). https://doi.org/10.25596/jalc-2002-549
Article MathSciNet MATH Google Scholar
McNaughton, R.: The loop complexity of pure-group events. Inf. Control 11(1), 167–176 (1967). https://doi.org/10.1016/S0019-9958(67)90481-0
Article MathSciNet MATH Google Scholar
McNaughton, R., Papert, S.: Counter-Free Automata. MIT Press, Cambridge (1971)
MATH Google Scholar
Paz, A., Peleg, B.: Ultimate-definite and symmetric-definite events and automata. J. ACM 12(3), 399–410 (1965). https://doi.org/10.1145/321281.321292
Article MathSciNet MATH Google Scholar
Paz, A., Peleg, B.: On concatenative decompositions of regular events. IEEE Trans. Comput. C–17(3), 229–237 (1968). https://doi.org/10.1109/TC.1968.229096
Article MATH Google Scholar
Rogers, H., Jr.: Theory of Recursive Functions and Effective Computability. MIT Press, Cambridge (1987)
Google Scholar
Schmid, M.L.: Inside the class of regex languages. Int. J. Found. Comput. Sci. 24(07), 1117–1134 (2013). https://doi.org/10.1142/S0129054113400340
Article MathSciNet MATH Google Scholar
Schmid, M.L.: Characterising regex languages by regular languages equipped with factor-referencing. Inf. Comput. 249, 1–17 (2016). https://doi.org/10.1016/j.ic.2016.02.003
Article MathSciNet MATH Google Scholar
Soare, R.I.: Recursively Enumerable Sets and Degrees. Springer, Berlin (1987)
Book MATH Google Scholar
Thierrin, G.: Permutation automata. Math. Syst. Theory 2, 83–90 (1968). https://doi.org/10.1007/BF01691347
Article MathSciNet MATH Google Scholar

Download references

Funding

We (Jingnan Xie, Harry B. Hunt, III) certify that we have no affiliations with or involvement in any organization or entity with any financial interest, or non-financial interest in the subject matter or materials discussed in this manuscript.

Author information

Authors and Affiliations

Computer Science, Millersville University of PA, 40 Dilworth Rd, Millersville, PA, 17551, USA
Jingnan Xie
Computer Science, University at Albany, SUNY, 1400 Washington Avenue, Albany, NY, 12222, USA
Harry B. Hunt III

Authors

Jingnan Xie
View author publications
You can also search for this author in PubMed Google Scholar
Harry B. Hunt III
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jingnan Xie.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Xie, J., Hunt, H.B. On the undecidability and descriptional complexity of synchronized regular expressions. Acta Informatica 60, 257–278 (2023). https://doi.org/10.1007/s00236-023-00439-3

Download citation

Received: 13 June 2022
Accepted: 20 March 2023
Published: 10 April 2023
Issue Date: September 2023
DOI: https://doi.org/10.1007/s00236-023-00439-3

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

On the undecidability and descriptional complexity of synchronized regular expressions

Abstract

Similar content being viewed by others

On the Semantics of Atomic Subgroups in Practical Regular Expressions

Descriptional Complexity of Bounded Regular Languages

Star-Freeness, First-Order Definability and Aperiodicity of Structured Context-Free Languages

1 Introduction

2 Definitions and notations

Definition 1

Definition 2

Example 2.1

Example 2.2

Example 2.3

Definition 3

Example 2.4

Proof

Example 2.5

Proof

Example 2.6

Proof

3 Productiveness and the predicate “ \(=\{0,1\}^*\) ” for SRE

3.1 Productiveness

Definition 4

Proposition 1

Proposition 2

3.2 The predicate “ \(=\{0,1\}^*\) ” for SRE

Definition 5

Proposition 3

Theorem 3.1

Definition 6

Corollary 1

Proof

4 Language predicates for SRE

Definition 7

Theorem 4.1

Proof

Corollary 2

Theorem 4.2

Proof

Theorem 4.3

Proof

Theorem 4.4

Proof

Corollary 3

Proof

Corollary 4

Proof

Theorem 4.5

Proof

5 Descriptional complexity of SRE

Lemma 5.1

Theorem 5.1

Theorem 5.2

Proof

Corollary 5

Definition 8

Lemma 5.2

Proof

Lemma 5.3

Proof

Theorem 5.3

6 Conclusion

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation