On the Complexity of the Smallest Grammar Problem over Fixed Alphabets

Casel, Katrin; Fernau, Henning; Gaspers, Serge; Gras, Benjamin; Schmid, Markus L.

doi:10.1007/s00224-020-10013-w

On the Complexity of the Smallest Grammar Problem over Fixed Alphabets

Open access
Published: 13 November 2020

Volume 65, pages 344–409, (2021)
Cite this article

Download PDF

You have full access to this open access article

Theory of Computing Systems Aims and scope Submit manuscript

On the Complexity of the Smallest Grammar Problem over Fixed Alphabets

Download PDF

Katrin Casel¹,
Henning Fernau²,
Serge Gaspers³,
Benjamin Gras⁴ &
…
Markus L. Schmid ORCID: orcid.org/0000-0001-5137-1504⁵

2805 Accesses
7 Citations
Explore all metrics

Abstract

In the smallest grammar problem, we are given a word w and we want to compute a preferably small context-free grammar G for the singleton language {w} (where the size of a grammar is the sum of the sizes of its rules, and the size of a rule is measured by the length of its right side). It is known that, for unbounded alphabets, the decision variant of this problem is NP-hard and the optimisation variant does not allow a polynomial-time approximation scheme, unless P = NP. We settle the long-standing open problem whether these hardness results also hold for the more realistic case of a constant-size alphabet. More precisely, it is shown that the smallest grammar problem remains NP-complete (and its optimisation version is APX-hard), even if the alphabet is fixed and has size of at least 17. The corresponding reduction is robust in the sense that it also works for an alternative size-measure of grammars that is commonly used in the literature (i. e., a size measure also taking the number of rules into account), and it also allows to conclude that even computing the number of rules required by a smallest grammar is a hard problem. On the other hand, if the number of nonterminals (or, equivalently, the number of rules) is bounded by a constant, then the smallest grammar problem can be solved in polynomial time, which is shown by encoding it as a problem on graphs with interval structure. However, treating the number of rules as a parameter (in terms of parameterised complexity) yields W[1]-hardness. Furthermore, we present an $\mathcal {O}(3^{\mid {w}\mid })$ exact exponential-time algorithm, based on dynamic programming. These three main questions are also investigated for 1-level grammars, i. e., grammars for which only the start rule contains nonterminals on the right side; thus, investigating the impact of the “hierarchical depth” of grammars on the complexity of the smallest grammar problem. In this regard, we obtain for 1-level grammars similar, but slightly stronger results.

A really Simple Approximation of Smallest Grammar

On Minimal Grammar Problems for Finite Languages

The Smallest Grammar Problem Revisited

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Context-free grammars are among the most classical concepts in theoretical computer science. Their wide range of applications, both of theoretical and practical nature, is well-known and usually forms an integral part of academic undergraduate courses in computer science. In this paper, we are concerned with grammars G that describe singleton languages {w} (or, by slightly abusing notation, grammars describing single words).^{Footnote 1}

1.1 Grammars as Inference Tools and Compressors

Although, from a formal languages point of view, describing a single word by a context-free grammar seems excessive, there are at least two evident motivations:

Compression Perspective:^{Footnote 2} The grammar G is a compressed representation of the word w.
Inference Perspective: The grammar G identifies the hierarchical structure of the word w.

The inference perspective can be traced back to the work of Nevill-Manning and Witten [1, 2],^{Footnote 3} in which the authors consider algorithmic possibilities of extracting (hierarchical) structure from sequential data, such as texts (in a natural or formal language), music or DNA, by constructing a grammar for a given sequence. The hypothesis that small grammars are to be preferred can be considered as an application of Occam’s razor (note that the size of a grammar is the sum of the sizes of its rules, where the size of a rule is measured by the length of its right side). In a more general sense, Nevill-Manning and Witten’s approach embarks on the quest of inferring the intrinsic information content of a given sequence, which is a central problem in learning theory and algorithmic information theory (especially Kolmogorov complexity, as mentioned below). In Nevill-Manning’s PhD-thesis [2], a multitude of connections between the compression perspective of computing grammars for single words and other core topics of mathematics and theoretical computer science are discussed (e. g., the minimum description length principle in learning theory, information theory, data compression). The inference perspective of computing grammars for single words has been applied in two more PhD-theses, namely by de Marcken [3] in order to investigate whether analysing the structure of small grammars for large English texts could help understanding the structure of the language itself, and by Gallé [4] in order to infer hierarchical structures in DNA. Moreover, Lanctot et al. [5] contribute to the work on estimating the entropy of DNA-sequences (see the references in [5]), by using an algorithm first proposed by Kieffer and Yang [6] to compute grammars for DNA-sequences.

While in the above mentioned work, grammars are mainly used as an inference tool, the obvious connections to data compression are often highlighted as well (e. g., in [2]). The work of Kieffer et al. [6,7,8] directly approaches the concept of representing words by grammars from a traditional data compression perspective, i. e., we want to compute a small grammar representing a large given word w (in the following, we denote the general concept of compressing a single word by a context-free grammar as grammar-based compression). Besides the above mentioned papers by Nevill-Manning and Witten, the work by Kieffer et al. is usually stated as the second origin of using grammars for single words, but a closer look into the older literature reveals that the external pointer macro scheme (without overlapping and with pointer size 1) defined by Storer and Szymanski [9, 10] is also equivalent to grammar-based compression.

Another motivation is that grammar-based compression, like any lossless data compression scheme, provides a computable upper bound of the Kolmogorov complexity (see [11]). Since this central measure in algorithmic information theory is generally incomputable, such computable approximations are important and, in this regard, grammars are of relevance, since, in comparison to other practically applied compression schemes, they achieve high compression rates and therefore yield a better approximation of the Kolmogorov complexity (in this regard, note that many practically relevant compression schemes, e. g., some of the ones mentioned in Section 1.3, allow fast compression and decompression, but cannot achieve exponential compression rates).

1.2 Algorithmics on Compressed Strings

The original motivations outlined so far are still relevant, but the actual reason why grammar-based compression has experienced a renaissance and thrives today as an independent and important field of research on its own are the following. While in the early days of computer science, the most important requirements for compression schemes were fast (i. e., linear or near linear time) compression and decompression, nowadays the investigation regarding whether they are suitable for solving problems directly on the compressed data without prior decompression forms a vibrant research area.^{Footnote 4} This area is usually subsumed under the term algorithmics on compressed strings, and grammar-based compression is particularly well suited for this purpose.

The success of grammars with respect to algorithmics on compressed strings is due to the fact that they cover many compression schemes from practice (most notably, the family of Lempel-Ziv encodings) and that they are mathematically easy to handle (see Lohrey [15] for a survey on the role of grammar-based compression for algorithmics on compressed strings). Many basic problems on strings, e. g., comparison, pattern matching, membership in a regular language, retrieving subwords, etc. can all be solved in polynomial time directly on the grammars [15]. In addition, grammar-based compression has been successfully applied in combinatorial group theory (see the textbook [16] by Lohrey) and to prove problems in computational topology to be polynomial-time solvable [15]. Grammars as compression schemes have also been extended to more complicated objects, e. g., trees (see [17,18,19,20,21], and [21, 22] for applications in term unification) and two-dimensional words (see [23]). It is also worth pointing out the successful applications of compression-techniques for solving word equations (see, e. g., [24, 25]).

A rather recent result is that any context-free grammar for a single word can be transformed in linear time into an equivalent one that is balanced in the sense that the depth of its derivation tree is logarithmic in the size of the represented word (see [26]). This result has a direct impact on basic algorithmic problems on grammar-compressed data, e. g., the random access problem (i. e., accessing in the compressed string the symbol at a given position).

1.3 The Smallest Grammar Problem

For grammar-based compression, the central computational problem is that of computing a smallest (or at least small) grammar for a given word, which is called the smallest grammar problem,^{Footnote 5} and the respective literature is mainly about approximation algorithms:^{Footnote 6}LZ78 [35], LZW [36], Bisection [7], Sequitur [1, 2] and Sequential [8], Longest Match [6], Greedy [37], Re-Pair [38] (the names of algorithms in this list are according to [33, 34]). These algorithms share the benefit of being rather simple and fast, and their approximation ratios have been studied thoroughly by Charikar et al. in [33], by Lehman in his PhD-thesis [34] and some bounds have recently been further improved by Hucke et al. [39]. Unfortunately, none of the approximation ratios are constant and the currently best achieved approximation ratio is $\mathcal {O}\left (\log \left (\frac {|{w}|}{m^{*}}\right )\right )$, where m^∗ is the size of a smallest grammar (i. e., it is still open whether an approximation algorithm with a constant approximation-ratio exists, or equivalently, whether the problem is in APX). This result is due to the algorithms by Rytter [40] and Charikar et al. [33, 34], which have been developed independently from each other and are not mentioned in the above list. On the other hand, assuming P≠NP, it has been shown in [33, 34] that an approximation ratio better than $\frac {8569}{8568} \approx 1.0001$ is not possible (thus, ruling out a polynomial-time approximation scheme (PTAS)). However, the research seems to have stagnated at this huge gap between lower and upper bound and still neither an approximation algorithm with a constant approximation ratio nor stronger inapproximability results are known.

The strong bias towards approximation algorithms is usually justified by the general NP-hardness of the smallest grammar problem, but, as explained next, this theoretical justification is seriously flawed. The NP-completeness can be shown by a reduction from vertex cover (see [33, 34]), but in the reduction, an unbounded number of symbols in the underlying alphabet is needed. This means nothing less than that the hardness-reduction is invalid for any realistic scenario, where we deal with a constant alphabet (even more, if the alphabet is rather small, as it is the case in practical applications). Consequently, since the motivation for the approximation algorithms mentioned above is of a rather practical kind (i. e., string compression in real-world scenarios), this theoretical foundation falls apart (in particular, note that an unbounded alphabet is also necessary for the inapproximability result of [33, 34]). One reason for this situation is probably that in [41], it is claimed that the hardness for alphabets of size 3 follows from [10], but a closer look into [10] does not confirm this (we elaborate on this claim in Section 2.4). Consequently, the NP-hardness of the smallest grammar problem for fixed alphabets is essentially open (for well over 30 years, taking [9, 10] as the first reference, which investigates hardness and complexity questions).

1.4 Our Contribution

The main result of this paper is a reduction that proves the smallest grammar problem for fixed alphabets to be NP-complete, at least for alphabet sizes of 17 or larger. As explained above, this closes an important gap in the literature and therefore puts the previous work on grammar-based compression on a more solid theoretical foundation.

Moreover, it also follows that the optimisation version of the smallest grammar problem is APX-hard; thus, the impossibility of a PTAS, previously only known for unbounded alphabets, carries over to the more realistic case of bounded alphabets. By a minor modification of this reduction, we can also show that these two hardness results hold for a slightly different (but frequently used) size measure of grammars, i. e., the rule-size, which equals the size of a grammar as defined above plus the number of its rules (both these measures are formally defined Section 2.2).

Given these negative complexity results, we move on to the question of whether smallest grammars can be efficiently computed, if certain parameters (e. g., levels of the derivation tree, number of rules) are bounded. In this regard, we show that smallest grammars can be computed in polynomial time, provided that the size of the nonterminal alphabet (i. e., number of rules) is bounded. This result, which is due to an encoding of the smallest grammar problem as a problem on graphs with interval structure, raises two follow-up questions: (1) is the problem fixed-parameter tractable with respect to the number of rules, (2) is it possible to efficiently compute, how many rules are at least necessary for a smallest grammar? Both of these questions are answered in the negative, by showing W[1]-hardness and NP-hardness, respectively.

Finally, we investigate exact exponential-time algorithms which are not yet considered in the literature. We consider this a relevant topic, since grammars are particularly suitable for solving basic problems directly on the compressed representation without decompression, which motivates scenarios, where an extensive running time is invested only once, in order to obtain an optimal compression, which is then stored and worked with. While brute-force algorithms with running time $\mathcal {O}^{*}(c^{|{w}|})$, for a constant c, can be easily found, we present a dynamic programming algorithm with running time $\mathcal {O}^{*}(3^{{|w|}})$.

The exploitation of hierarchical structure is one of the main features of grammars (making them suitable tools for structural inference, and also allowing exponential compression rates) and is reflected in the number of levels of the corresponding derivation tree. Hence, from a (parameterised) complexity point of view, it is natural to measure the impact of this “hierarchical depth” of grammars with respect to the complexity of the smallest grammar problem. To this end, we investigate the above mentioned questions also for 1-level grammars, i. e., grammars in which only the start rule contains nonterminals and, surprisingly, our results suggest that computing general grammars is, if at all, only insignificantly more difficult than computing 1-level grammars. More precisely, the smallest grammar problem for 1-level grammars is NP-hard for alphabets of size 5 (also with respect to the rule size measure), W[1]-hard if parameterised by the number of rules, it can be solved in polynomial time if the number of rules is bounded by a constant and there is an $\mathcal {O}^{*}(1.8392^{|{w}|})$ exact algorithm. Moreover, the exact exponential-time algorithm for the general case works incrementally in the sense that in the process of producing a smallest grammar, it also produces a smallest 1-level grammar, a smallest 2-level grammar and so on.

1.5 Outline of the Paper

In Section 2, we give basic definitions, we define the smallest grammar problem, we illustrate it with several examples and also illustrate in detail the connections between grammar-based compression and the related macro schemes by Storer and Szymanski [9]. The next section contains the hardness results mentioned above, where the 1-level and the multi-level case is treated separately in Sections 3.1 and 3.2, respectively (in Section 3.3, we define and discuss possible extensions of the hardness reductions). The second main part of the paper is Section 4, where we show that the smallest grammar problem can be solved in polynomial time, if the number of nonterminals is bounded (in Section 4.1, we discuss some related questions). In the last part, Section 5, we first present a (simple) exact exponential-time algorithm for the 1-level case and then, in Section 5.2, we define the dynamic programming algorithm for the multi-level case. Finally, in Section 6, we summarise our results, point out open problems and mention further research tasks.

2 Preliminaries

In this section, we first introduce some general mathematical definitions and terminology about strings, and some basic concepts from graph theory and complexity theory. Then we define grammars and the smallest grammar problem and illustrate it by several examples. We conclude this section by a discussion of Storer and Szymanski’s external pointer macro scheme already mentioned in Section 1.

Let $\mathbb {N} = \{1, 2, 3, \ldots \}$ denote the natural numbers. By |A|, we denote the cardinality of a set A. Let Σ be a finite alphabet of symbols. A word or string (over Σ) is a sequence of symbols from Σ. For any word w over Σ, |w| denotes the length of w and ε denotes the empty word, i. e., |ε| = 0. The symbol Σ⁺ denotes the set of all non-empty words over Σ and Σ^∗ = Σ⁺ ∪{ε}. For the concatenation of two words w₁, w₂ we write w₁ ⋅ w₂ or simply w₁w₂. For every symbol a ∈Σ, we denote by |w|_a the number of occurrences of symbol a in w. We say that a word v ∈Σ^∗ is a factor of a word w ∈Σ^∗ if there are $u_{1}, u_{2} \in {\Sigma }^{*}$ such that w = u₁vu₂. If u₁ = ε (or u₂ = ε), then v is a prefix (or a suffix, respectively) of w. Furthermore, F(w) = {u∣u is a factor of w} and F_≥ 2(w) = {u∣u ∈F(w),|u|≥ 2}. For a position j, 1 ≤ j ≤|w|, we refer to the symbol at position j of w by the expression w[j] and $w[j..j^{\prime }] = w[j] w[j + 1] {\ldots } w[j^{\prime }]$, $j \leq j^{\prime } \leq |{w}|$. By w^R, we denote the reversal of w, i. e., w^R = w[n]w[n − 1]…w[1], where |w| = n.

A factorisation of a word w is a tuple (u₁, u₂,…, u_k) with u_i≠ε, 1 ≤ i ≤ k, such that w = u₁u₂…u_k.

2.1 Basic Concepts of Graph Theory and Complexity Theory

We use undirected graphs, which are represented as pairs (V, E), where V is the set of vertices and E is the set of edges. For the sake of convenience, we write edges {u, v}∈ E also as (u, v) or (v, u). For a vertex v ∈ V, N(v) = {u∣(v, u) ∈ E} is the (open) neighbourhood (of v), N[v] = N(v) ∪{v} is the closed neighbourhood (of v) and, furthermore, we extend the notation of closed neighbourhood to sets $C \subseteq V$ in the obvious way, i. e., $N[C] = \bigcup _{v \in C}N[v]$. A graph is cubic (or subcubic) if, for every v ∈ V, |N(v)| = 3 (or |N(v)| ≤ 3, respectively).

A set $C \subseteq V$ is

an independent set if, for every u, v ∈ C, (u, v)∉E,
a dominating set if N[C] = V,
an independent dominating set if it is both an independent and a dominating set,
a vertex cover if, for every (u, v) ∈ E, {u, v}∩ C≠∅.

We are concerned with the corresponding problems of deciding, for a given graph G and a $k \in \mathbb {N}$, whether there is a vertex cover (or an independent dominating set) of cardinality at most k. It is a well-known fact that these decision problems are NP-complete problems (see [42]).

For $k \in \mathbb {N}$, a graph G = (V, E), with |V | = n, is a k-interval graph, if there are intervals I_{i, j}, 1 ≤ i ≤|V |, 1 ≤ j ≤ k, on the real line, such that G is isomorphic to $(\{v_{i} \mid 1 \leq i \leq |V|\}, \{(v_{i}, v_{i^{\prime }}) \mid \bigcup ^{k}_{j = 1} I_{i, j} \cap \bigcup ^{k}_{j = 1} I_{i^{\prime }, j} \neq \emptyset \})$. For 1-interval graphs (which are also just called interval graphs), it is possible to compute minimal independent dominating sets in linear time (see [43]; note that a perfect elimination ordering (that is part of the input of Farber’s algorithm) can be easily computed in our applications, because the intervals are clear).

We assume the reader to be familiar with the basic concepts of complexity theory (for unexplained notions, see Papadimitriou [44]) and the theory of NP-completeness (see [44] and [42]).

As usual, for our running-time estimations, we mainly use the $\mathcal {O}$-notation, but sometimes also the $\mathcal {O}^{*}$-notation (ignoring polynomial factors). The latter is appropriate, if we are dealing with exponential-time algorithms (see Section 5).

Since we also wish to discuss some of our results from the parameterised complexity point of view, we shall briefly mention the concepts relevant for us (for detailed explanations on parameterised complexity, the reader is referred to the textbooks [45,46,47]). A parameterised problem is a decision problem with instances (x, k), where x is the actual input and $k \in \mathbb {N}$ is the parameter. By XP, we denote the class of parameterised problems that are solvable in time $\mathcal {O}(n^{f(k)})$ (where n is the size of the instance) and FPT denotes the class of fixed-parameter tractable problems, i. e., problems having an algorithm with running-time $\mathcal {O}(g(k) \cdot f(n))$, for a computable function g and polynomial f.

In order to argue about fixed-parameter intractability, we need the following kind of reductions. A (classical) many-one reduction R from a parameterised problem to another is an fpt-reduction, if the parameter of the target problem is bounded in terms of the parameter of the source problem, i. e., there is a recursive function $h\colon \mathbb {N} \rightarrow \mathbb {N}$ such that $R(x, k) = (x^{\prime }, k^{\prime })$ implies $k^{\prime } \leq h(k)$.

We shall use two different kinds of fixed-parameter intractability. First, if a parameterised problem is NP-hard if the parameter is fixed to a constant, then it is not in FPT, unless P = NP. As a slightly weaker form of fixed-parameter intractability, the framework of parameterised complexity provides the classes of the so-called W-hierarchy, for which the hard problems (with respect to fpt-reductions) are considered fixed-parameter intractable, i. e., they are not in FPT (under some complexity theoretical assumptions). For a detailed definition of the W-hierarchy, we refer to the textbooks [45,46,47]; in this paper, we only use the first level of this hierarchy, i. e., the class W[1], and our respective intractability results are W[1]-hardness results.

A minimisation problem^{Footnote 7}P is a triple (I, S, m) with I being the set of instances, S being a function that maps instances x ∈ I to the set of feasible solutions for x, and m being the objective function that maps pairs (x, y) with x ∈ I and y ∈ S(x) to a positive rational number. For every x ∈ I, we denote $m^{*}(x):=\min \limits \{m(x,y)\colon y\in S(x)\}$. For two minimisation problems P₁, P₂ with P_j given by (I_j, S_j, m_j), j ∈{1,2}, an L-reduction from P₁ to P₂ is a quadruple (f, g, β, γ) such that

f is a polynomial-time computable function from I₁ to I₂ that satisfies, for every x ∈ I₁ with S₁(x)≠∅, S₂(f(x))≠∅.
g is a polynomial-time computable function that, for every x ∈ I₁ and y ∈ S₂(f(x)), maps (x, y) to a solution in S₁(x).
β is a constant such that $m_{2}^{*}(f(x))\leq \beta \cdot m_{1}^{*}(x)$ for each x ∈ I₁.
γ is a constant such that $m_{1}(x,g(x,y))-m_{1}^{*}(x)\leq \gamma \cdot (m_{2}(f(x),y)-m_{2}^{*}(f(x)))$ for each x ∈ I₁ and y ∈ S₂(f(x)).

We shall use L-reductions in order to show hardness for APX, the class of optimisation problems for which there exists an approximation algorithm with a constant approximation ratio. Note that, unless P = NP, an APX-hard problem does not have a polynomial-time approximation scheme (see [48] for detailed information of approximation hardness).

2.2 Grammars

A context-free grammar is a tuple G = (N,Σ, R, S), where N is the set of nonterminals, Σ is the terminal alphabet, S ∈ N is the start symbol and $R \subseteq N \times (N \cup {\Sigma })^{+}$ is the set of rules (as a convention, we write rules (A, w) ∈ R also in the form A → w). A context-free grammar G = (N,Σ, R, S) is a singleton grammar if R is a total function N → (N ∪Σ)⁺ and the relation {(A, B)∣(A, w) ∈ R,|w|_B ≥ 1} is acyclic.

For a singleton grammar G = (N,Σ, R, S), let D_G: (N ∪Σ) → (N ∪Σ)⁺ be defined by D_G(A) = R(A), A ∈ N, and D_G(a) = a, a ∈Σ. We extend D_G to a morphism (N ∪Σ)⁺ → (N ∪Σ)⁺ by setting D_G(α₁α₂…α_n) = D_G(α₁)D_G(α₂)…D_G(α_n), for α_i ∈ (N ∪Σ), 1 ≤ i ≤ n. Furthermore, for every α ∈ (N ∪Σ)⁺, we set ${\mathsf {D}^{1}_{G}}(\alpha ) = \mathsf {D}_{G}(\alpha )$, ${\mathsf {D}^{k}_{G}}(\alpha ) = \mathsf {D}(\mathsf {D}^{k-1}_{G}(\alpha ))$, for every k ≥ 2, and $\mathfrak {D}_{G}(\alpha ) = \lim _{k \to \infty } {\mathsf {D}^{k}_{G}}(\alpha )$ is the derivative of α. By definition, $\mathfrak {D}_{G}(\alpha )$ exists for every α ∈ (N ∪Σ)⁺ and is an element from Σ⁺. The size of the singleton grammar G is defined by $|G| = {\sum }_{A \in N} |\mathsf {D}_{G}(A)|$ and the rule-size of G is defined by |G|_r = |G| + |N| or, equivalently, $|G|_{\mathsf {r}} = {\sum }_{A \in N} (|\mathsf {D}_{G}(A)| + 1)$. Our main size measure will be |⋅|. The rule-size |⋅|_r will play a role in Section 3.3 and will be discussed in more detail there.

Remark 1

The class of singleton grammars exactly coincides with the class of context-free grammars that do not have unreachable rules (i. e., rules that cannot occur in any derivation) and that can derive exactly one word. As mentioned before, such grammars are also called straight-line programs in the literature. A context-free grammar that can derive only a single word and is not a singleton grammar must contain some rules that are not reachable. Since unreachable rules can easily be discovered and removed, we directly add this restriction to the concept of singleton grammars.

The derivation tree of G is a ranked ordered tree with node-labels from Σ ∪ N, inductively defined as follows. The root is labelled by S and every node labelled by A ∈ N with D(A) = α₁α₂…α_n has n children labelled by α₁, α₂,…, α_n in exactly this order; note that this means that all leaves are from Σ.

From now on, we simply use the term grammar instead of singleton grammar and if the grammar under consideration is clear from the context, we also drop the subscript G. We set $\mathfrak {D}(G) = \mathfrak {D}(S)$ and say that Gis a grammar for $\mathfrak {D}(G)$. Since for singleton grammars, the start symbol is somewhat superfluous, we will ignore it and denote grammars G = (N,Σ, R, S) in the form G = (N,Σ, R, ax) instead, where ax = R(S) is called the axiom (of G). In particular, we interpret derivations to start directly with the axiom and, correspondingly, we also sometimes ignore the root of derivation trees. However, this does not change the size measures |⋅| and |⋅|_r, which, when ignoring the start symbol, can also be defined as $|G| = ({\sum }_{A \in N} |\mathsf {D}_{G}(A)|) + |\mathsf {ax}|$ and $|G|_{\mathsf {r}} = ({\sum }_{A \in N} (|\mathsf {D}_{G}(A)| + 1)) + |\mathsf {ax}| + 1$.

The number of levels of a grammar G = (N,Σ, R, ax) is $\min \limits \{k \mid {\mathsf {D}^{k}_{G}}(\mathsf {ax}) = \mathfrak {D}_{G}(\mathsf {ax})\}$, and a grammar with d levels is a d-level grammar. Intuitively speaking, a grammar G is a d-level grammar if we need exactly d derivation steps in order to derive $\mathfrak {D}(G)$ from the axiom; thus, the number of levels measures what we called in the introduction the “hierarchical depth” of a grammar. Note that for a d-level grammar, the derivation tree has a maximum depth of d + 1 and d + 2 levels (when counting the root as well). With this definition, the grammars that are the most restricted with respect to their hierarchical depth and that are still reasonable, are 1-level grammars (i. e., an axiom that derives a word in one step).

Let G = (N,Σ, R, ax) be a 1-level grammar. The profit of a rule (A, α) ∈ R is defined by p(A) = |ax|_A(|α|− 1) −|α|. Intuitively speaking, if all occurrences of A in ax are replaced by α and the rule A → α is deleted, then the size of the grammar increases by exactly p(A). Consequently, $|G| = |\mathfrak {D}(G)| - {\sum }_{A \in N} \mathsf {p}(A)$.

Example 1

The grammar G = (N,Σ, R, ax) with N = {A, B}, Σ = {,}, ax = AAB and

$$ R = \{A \to B {\mathtt{a}} B, B \to {\mathtt{b}} {\mathtt{a}} {\mathtt{a}} {\mathtt{b}}\} $$

is a 2-level grammar of size 13 (and rule-size 16) with axiom AAB. Furthermore, $\mathfrak {D}(B) = {\mathtt {b}} {\mathtt {a}} {\mathtt {a}} {\mathtt {b}}$, $\mathfrak {D}(A) = \mathfrak {D}(B) {\mathtt {a}} \mathfrak {D}(B) = {\mathtt {b}} {\mathtt {a}} {\mathtt {a}} {\mathtt {b}} {\mathtt {a}} {\mathtt {b}} {\mathtt {a}} {\mathtt {a}} {\mathtt {b}}$ and

$$ \mathfrak{D}(G) = \mathfrak{D}(S) = \underbrace{{\mathtt{b}} {\mathtt{a}} {\mathtt{a}} {\mathtt{b}} {\mathtt{a}} {\mathtt{b}} {\mathtt{a}} {\mathtt{a}} {\mathtt{b}}}_{\mathfrak{D}(A)} {\mathtt{b}} {\mathtt{a}} \underbrace{{\mathtt{b}} {\mathtt{a}} {\mathtt{a}} {\mathtt{b}} {\mathtt{a}} {\mathtt{b}} {\mathtt{a}} {\mathtt{a}} {\mathtt{b}}}_{\mathfrak{D}(A)} \underbrace{{\mathtt{b}} {\mathtt{a}} {\mathtt{a}} {\mathtt{b}}}_{\mathfrak{D}(B)} {\mathtt{b}} . $$

Consequently, G is a size 13 representation of a word of length 25. A derivation tree of G can be seen in Fig. 1.

Replacing the axiom by R(A)R(A)B = BBBBB and deleting rule A → BB turns G into a 1-level grammar $G^{\prime }$ with $\mathfrak {D}(G^{\prime }) = \mathfrak {D}(G)$. Moreover, p(B) = |ax|_B(|R(B)|− 1) −|R(B)| = 5(4 − 1) − 4 = 11 and $|G^{\prime }| = |\mathfrak {D}(G^{\prime })| - \mathsf {p}(B) = 25 - 11 = 14$.

A smallest grammar for a word w is any grammar G with $\mathfrak {D}(G) = w$ and $|G| \leq |G^{\prime }|$ for every grammar $G^{\prime }$ with $\mathfrak {D}(G^{\prime }) = w$; generally, a grammar G is smallest if it is a smallest grammar for $\mathfrak {D}(G)$ (grammars that are smallest with respect to the rule-size measure will be called r-smallest grammars). The decision problem variant of computing smallest grammars is defined as follows:

Smallest Grammar Problem (SGP)
Instance: A word w and a $k \in \mathbb {N}$.
Question: Does there exist a grammar G with $\mathfrak {D}(G) = w$ and |G|≤ k?

The Smallest 1-Level Grammar Problem(1-SGP) is defined analogously, with the only difference that we ask for a 1-level grammar of size at most k. By SGP_r and 1-SGP_r, we denote the problem variants, where we consider the rule-size instead of the size, i. e., we require |G|_r ≤ k.

The optimisation variant of SGP, i. e., the task of actually producing a smallest grammar for a given word w, shall be denoted by SGP_opt (and SGP_{r, opt} if we are concerned with the rule-size). More precisely, according to the definitions given in Section 2.1, SGP_opt = (I, S, m), where I = Σ^∗, $S(w) = \{G \mid \mathfrak {D}(G) = w\}$ and m(w, G) = |G| (or m(w, G) = |G|_r for SGP_{r, opt}).

2.3 Examples

While the following examples illustrate the smallest grammar problem in general, they are particularly tailored to the technicalities to be encountered in Section 3, i. e., they shall point out the difficulties arising in predicting how factors in a larger word are compressed by a smallest grammar, which is crucial in the design of gadgets for a hardness reduction.

Let $w = {\prod }^{n}_{i = 1} 1 0^{i}$ be a word over the binary alphabet Σ = {0,1}, where n = 2^k, $k \in \mathbb {N}$. This word has a very simple structure and can be interpreted as a list of a (potentially unbounded) number of integers. This is crucial, since if we want to encode objects (e. g., graphs), the size of which is not bounded in terms of the alphabet size, then structures of this form will inevitably appear.

One way of compressing w that comes to mind is by the use of rules A₁ → 10, A_i → A_i− 10, 2 ≤ i ≤ n − 1, and an axiom A₁A₂…A_n− 1A_n− 10, which leads to the grammar G₁ = (N,Σ, R, ax), with:

$$ \begin{array}{@{}rcl@{}} N &=&\{A_{i} \mid 1 \leq i \leq n-1\} ,\\ R &=&\{A_{1} \to 1 0\} \cup \{A_{i} \to A_{i - 1}0 \mid 2 \leq i \leq n-1\} ,\\ \mathsf{ax} &=& A_{1} A_{2} {\ldots} A_{n - 1} A_{n - 1} 0 . \end{array} $$

This grammar has an overall size given by $|G_{1}| = \underbrace {n + 1}_{\mathsf {ax}} + \underbrace {2(n - 1)}_{\text {rules}} = 3n - 1$.

However, it is also possible to construct the factors 0ⁱ, 1 ≤ i ≤ n, “from the middle” by rules A₁ → 010, A_i → 0A_i− 10, $2 \leq i \leq \frac {n}{2} - 1$, and an axiom 1(A₁)²(A₂)²… By using these ideas, we can construct the smaller grammar G₂ = (N,Σ, R, ax), where

$$ \begin{array}{@{}rcl@{}} N &= &\{A_{i} \mid 1 \leq i \leq \tfrac{n}{2} - 1\} \cup \{B_{i} \mid 1 \leq i \leq k - 2\} ,\\ R &= &\{A_{1} \to 0 1 0, B_{1} \to 0 0\} \cup \{A_{i} \to 0 A_{i - 1}0 \mid 2 \leq i \leq \tfrac{n}{2} - 1\} \cup \\ &&\{B_{i} \to B_{i - 1} B_{i - 1} \mid 2 \leq i \leq k - 2\} ,\\ \mathsf{ax} &= &1 (A_{1})^{2} (A_{2})^{2} {\ldots} (A_{\frac{n}{2} - 1})^{2} 0 A_{\frac{n}{2} - 1}0 B_{k-2} B_{k-2} . \end{array} $$

We have $|G_{2}| = \underbrace {n + 4}_{\mathsf {ax}} + \underbrace {3(\tfrac {n}{2} - 1) + 2(k-2)}_{\text {rules}} = \frac {5n}{2} + 2k - 3$.

Both of these grammars achieve an asymptotic compression rate of order $\mathcal {O}(\sqrt {|{w}|})$, but, generally, grammars are capable of exponential compression rates (see [33, 34]). Aiming for such exponential compression, it seems worthwhile to represent every unary factor $0^{2^{\ell }}$, 1 ≤ ℓ ≤ k, by a nonterminal B_ℓ (obviously, this requires only k rules of size 2) and then represent all unary factors by sums of these powers (e. g., 0⁷⁴ is compressed by B₁B₃B₆). Formally, consider G₃ = (N,Σ, R, ax), where

$$ \begin{array}{@{}rcl@{}} N &= &\{B_{i} \mid 1 \leq i \leq k - 1\} ,\\ R &= &\{B_{1} \to 0 0\} \cup \{B_{i} \to B_{i - 1} B_{i - 1} \mid 2 \leq i \leq k - 1\} ,\\ \mathsf{ax} &= &\left( \prod\limits_{i = 1}^{n-1} 1 \alpha_{i}\right) (B_{k-1})^{2} , \end{array} $$

where α_i = x₀x₁…x_k− 1 and, for every j, 1 ≤ j ≤ k − 1, x_j = B_j if the j^th bit (i. e., the one representing 2^j) of the binary representation of i is 1 and x_j = ε otherwise. However, this yields a grammar of size

$$ |G_{3}| = \underbrace{\tfrac{1}{2}(n-1)k}_{\mathsf{ax}} + \underbrace{2(k-1)}_{\text{rules}} = \frac{k(n + 3)}{2} - 2 , $$

which, if k is sufficiently large, is worse than the previous grammars.

A grammar that is even smaller than G₂ can be obtained by combining the idea of G₂ with that of representing factors $0^{2^{\ell }}$ by nonterminals B_ℓ. More precisely, for every ℓ, 1 ≤ i ≤ k − 2, we represent $0^{2^{\ell }}$ by an individual nonterminal B_ℓ and, in addition, we use rules A₁ → 010, A_i → 0A_i− 10, $2 \leq i \leq \frac {n}{4}$. Then the left and right half of w can be compressed in the way of G₂, with the only difference that in the right part, for every unary factor, we also need an occurrence of B_k− 1, i. e., consider G₄ = (N,Σ, R, ax) with:

$$ \begin{array}{@{}rcl@{}} N &= &\{A_{i} \mid 1 \leq i \leq \tfrac{n}{4}\} \cup \{B_{i} \mid 1 \leq i \leq k-1\} ,\\ R &= &\{A_{1} \to 0 1 0, B_{1} \to 0 0\} \cup \{A_{i} \to 0 A_{i - 1} 0 \mid 2 \leq i \leq \tfrac{n}{4}\} \cup \\ &&\{B_{i} \to B_{i - 1} B_{i - 1} \mid 2 \leq i \leq k-1\} ,\\ \mathsf{ax} &= &1 (A_{1})^{2} (A_{2})^{2} {\ldots} (A_{\frac{n}{4}})^{2} B_{k-2} \\ &&(A_{1} B_{k-1})^{2} (A_{2} B_{k-1})^{2} {\ldots} (A_{\frac{n}{4} - 1} B_{k-1})^{2} A_{\frac{n}{4}} B_{k-1} B_{k-2} . \end{array} $$

This grammar yields a size of $|G_{4}| = \underbrace {\tfrac {3n}{2} + 1}_{\mathsf {ax}} + \underbrace {\tfrac {3n}{4} + 2(k-1)}_{\text {rules}} = \frac {9n}{4} + 2k - 1$. Note that again the asymptotic compression rate is of order $\mathcal {O}(\sqrt {|{w}|})$.

These considerations point out that even for simply structured words like w, it is very difficult to determine the structure of a smallest grammar or its size. However, for reducing an NP-hard problem, we need to know, to at least some extent, how smallest grammars compress the constructed strings in order to relate the reduced instances to the original instances. Consequently, the above examples point out the challenges that arise in this regard.

We conclude this list of examples, by pointing out that giving a smallest grammar for our toy-example $w = {\prod }^{n}_{i = 1} 1 0^{i}$ in dependency of n, is essentially an open problem. A respective asymptotic bound of ${\Omega }(\sqrt {|w|})$ is a reasonable assumption, but we have no proof for this claim.

2.4 Storer and Szymanski’s External Pointer Macro Scheme and Grammar-Based Compression

Storer and Szymanski [9] introduce a very general form of a compression scheme that covers a large variety of different compression strategies, in particular also grammar-based compression. On the one hand, we cite their work as the first that, in a sense, considered grammar-based compression, but in the context of our paper, it is also of greater importance for the following reasons. The technical report [10]^{Footnote 8} provides a comprehensive complexity analysis of many different variants of Storer and Szymanski’s compression scheme with many NP-hardness reductions. Some of the considered variants also concern the case of fixed alphabets, which has led to the misunderstanding that the hardness of the smallest grammar problem for fixed alphabets is provided by [10], leading to the misconception that also in practical scenarios – i. e., for fixed alphabets – grammar-based compression is known to be intractable. Since closing this gap by providing the assumed hardness result is one of the main objectives of this paper, we shall discuss in some more detail why it cannot already be found among the many hardness results of [10].

First, we recall the definitions of Storer and Szymanski [9] that are relevant here. For a word w ∈Σ⁺ and a pointer size $p \in \mathbb {N}$, a compressed form of w for pointer size p using the external pointer macro, EPM for short, is any word s₀#s₁ with $s_{0}, s_{1} \in ({\Sigma } \cup \{1, 2, \ldots , |s_{0}|\}^{2})^{+}$, #∉Σ, and w can be obtained from s₀#s₁ by repeating the following two steps:

Replace every symbol (i, j) in s₁ by s₀[i..j],
repeat the first step until s₁ equals w.

The size of an EPM s₀#s₁ is defined by ${\sum }_{i = 1}^{|s_{0} s_{1}|} \ell _{i}$, where ℓ_i = 1, if s₀s₁[i] ∈Σ and ℓ_i = p, otherwise (i. e., each occurrence of a symbol from $\{1, 2, \ldots , |s_{0}|\}^{2}$ (the actual pointers) contribute the pointer size p to the overall size of the EPM).

A grammar for a word w easily translates into an EPM for w. For example, the grammar G = (N,Σ, R, ax) with N = {A, B}, Σ = {a, b, c}, R = {A → BcB, B → ba} and ax = AabBBAc translates into the external pointer macro ba(1,2)c(1,2)#(3,5)ab(1,2)(1,2)(3,5)c. More precisely, the prefix ab is the right side of the rule for B, (1,2)c(1,2) corresponds to the right side of the rule for A, where the occurrences of B are represented by pointers (1,2) to the prefix s₀[1..2] = ab, (3,5)ab(1,2)(1,2)(3,5)c corresponds to the axiom, where occurrences of A and B are represented by pointers (3,5) and (1,2), respectively. If the pointer size is 1, then the EPM has the same size as the grammar.

If an EPM s₀#s₁ is non-overlapping, i. e., it is never the case that for two pointers (i, j) and (k, ℓ) we have i ≤ k ≤ j or k ≤ i ≤ ℓ, then it also translates into a grammar by transforming each pointer (i, j) into a nonterminal A_{(i, j)} with a rule A_{(i, j)} → s₀[i..j]. In this regard, it is important to note that the property of an EPM that s₁ can be turned into w by repeated replacement of the pointers ensures that the derivation function of the grammar constructed in this way is acyclic.

We conclude that the concept of singleton grammars and the concept of EPMs with pointer size 1 and without overlapping are more or less identical, i. e., they just differ syntactically. Consequently, the problem of grammar-based compression and the problem of computing smallest EPMs with pointer size 1 and without overlapping are identical problems.

However, a closer look at Storer [10] shows that in this paper the variant of computing EPMs with pointer size 1 is not considered. Instead, the focus is on EPMs (and other kind of compression schemes), for which the pointer size is not even constant, but a function of the length of the word that is compressed, typically logarithmic in the size |w|. Note that this avoids the main difficulties encountered when designing a reduction for grammar-based compression with fixed alphabets (see Section 3): the factors that encode vertices of a graph must have unbounded length, which makes it rather difficult to control how the grammar compresses these codewords. On the other hand, if the pointers (which correspond to nonterminals in the grammar) have size $\log (|{w}|)$, then it does not make sense to compress factors that are smaller than this size (since we gain nothing by replacing them by pointers). It is straightforward to represent a graph as a word of length linear in the size of the graph, where the length of the factors (i. e., the codewords) that represent single vertices are logarithmic in the size of the graph (this is the case in all reductions of [9, 33, 34]). The property mentioned above, i. e., that factors of logarithmic size are not compressed, then simply means that we can assume that the codewords for vertices are not compressed in the string that describes the graph, which makes is rather simple to devise a hardness reduction (in fact, controlling the possible compression of codewords is the main technical challenge in our reductions).

3 N P-Hardness of Computing Smallest Grammars for Fixed Alphabets

In their basic structure, the hardness reductions to be presented next are similar to the one from [33, 34], which shows NP-hardness of SGP for unbounded alphabets by a reduction from the vertex cover problem. All the effort of this section will consist in the extension of the general idea to the case of a fixed alphabet. In order to facilitate the accessibility of our technical proofs, we shall sketch this reduction from [33, 34].

Let $\mathcal {G} = (V, E)$ be a graph with

$$ V=\{v_{1},\dots,v_{n}\} \text{ and } E=\{(v_{j_{2i-1}},v_{j_{2i}})\mid 1 \leq i \leq m\} . $$

We define the following word over the alphabet V ∪{◇_i∣1 ≤ i ≤ 5n + m}∪{#} (for the sake of simplicity, every individual occurrence of ◇ in the word stands for a distinct symbol of {◇_i∣1 ≤ i ≤ 5n + m}):

$$ \begin{array}{@{}rcl@{}} w_{\mathcal{G}} = \prod\limits_{i = 1}^{n}(\#v_{i} \diamond v_{i}\# \diamond)^{2}\prod\limits_{i = 1}^{n}(\# v_{i} \# \diamond) \prod\limits_{i = 1}^{m}(\#v_{j_{2i-1}}\#v_{j_{2i}}\#\diamond) . \end{array} $$

Let G = (N,Σ, R, S) be a smallest grammar for $w_{\mathcal {G}}$, then we can observe the following:

For every A ∈ N, $\mathfrak {D}(A) \in \{\# v_{i}, v_{i} \#, \# v_{i} \# \mid 1 \leq i \leq n\}$. This is due to the fact that the only factors of $w_{\mathcal {G}}$ with repetitions are of the form #v_i, v_i# or #v_i#.
We can assume that, for every i, 1 ≤ i ≤ n, there are rules A_i → #v_i and B_i → v_i#, since if some of these rules are missing, then adding them and compressing the respective factors does not increase the size of the grammar.
Let $\mathfrak {I} \subseteq \{1, 2, \ldots , n\}$ contain exactly the indices i such that a rule with derivative #v_i# exists; moreover, we can assume that all these rules have the form C_i → A_i#.
Let ${\Gamma } = \{v_{i} \mid i \in \mathfrak {I}\}$. If an edge $(v_{j_{2i-1}}, v_{j_{2i}})$ is not covered by Γ, then adding a rule $C_{j_{2i-1}} \to A_{j_{2i-1}} \#$ or $C_{j_{2i}} \to A_{j_{2i}} \#$ does not increase the size of the grammar. So we can assume that Γ is a vertex cover.

These observations show that there exists a grammar G for $w_{\mathcal {G}}$ with |G|≤ 15n + 3m + k if and only if there is a vertex cover for $\mathcal {G}$ of size at most k (for a formal proof, we refer to [33, 34]).

A simple modification of this reduction yields the following.

Theorem 1

1-SGP is NP-complete.

Proof

We slightly change the reduction from [33, 34] as follows:

$$ \begin{array}{@{}rcl@{}} w_{\mathcal{G}} = \prod\limits_{i = 1}^{n}(\#v_{i}\diamond v_{i}\# \diamond)^{2}\prod\limits_{i = 1}^{n}(\#v_{i}\#\diamond)^{2} \prod\limits_{i = 1}^{m}(\#v_{j_{2i-1}}\#v_{j_{2i}}\#\diamond) . \end{array} $$

The only difference from the original reduction is that the size of the rules with derivative #v_i# has increased by 1, i. e., they now have the form C_i → #v_i#, so by repeating the factors #v_i# ◇, we make sure that adding such a rule whenever an edge is not covered does not increase the size of the grammar. □

In these reductions, we encode the different vertices of a graph by single symbols and also use individual separator symbols (i. e., symbols with only one occurrence in the word to be compressed). This makes it particularly easy to devise suitable gadgets, but, on the other hand, it assumes that we have an arbitrarily large alphabet at our disposal. In the remainder of this section, we shall extend these hardness results to the more realistic case of fixed alphabets. The general structure of our reductions is similar to the ones of [10, 33, 34] sketched above, but, due to the constraint of having a fixed alphabet, they substantially differ on a more detailed level. More precisely, since fixed alphabets make it impossible to use single symbols (or even words of constant size) as separators or as representatives for vertices, we need to use special encodings for which we are able to determine how a smallest grammar will compress them (in this regard, recall our examples from Section 2.3 demonstrating how difficult it can be to determine a smallest grammar even for a single simply structured word). This constitutes a substantial technical challenge, which complicates our reductions considerably.

In the following, we prove that 1-SGP and SGP are NP-hard, even for constant alphabet of size 5 and 24, respectively. The stronger result claimed in the abstract and introduction, i. e., the hardness of SGP for alphabets of size 17, is presented later as an improvement (see Section 3.4, Corollary 1).

3.1 The 1-Level Case

As a tool for proving the hardness of 1-SGP, but also as a result in its own right, we first show that the compression of any 1-level grammar is at best quadratic (in contrast to general grammars, which can achieve exponential compression). Note that the bound of Lemma 1 is tight, e. g., consider $\mathtt {a}^{n^{2}}$ and a grammar with rules S → Aⁿ and A →ⁿ.

Lemma 1

Let G be a 1-level grammar. Then $|G| \geq 2 \sqrt {|\mathfrak {D}(G)|}$.

Proof

Let $n = |\mathfrak {D}(G)|$, let ax be the axiom and let A → u be a rule with a right side of maximum length. Obviously, |ax||u|≥ n, and, since $x+y\geq 2\sqrt {xy}$ holds for all x, y ≥ 0, also $|\mathsf {ax}| + |u| \geq 2 \sqrt {|\mathsf {ax}||u|}$. Consequently,

$$ |G| \geq |\mathsf{ax}| + |u| \geq 2 \sqrt{|\mathsf{ax}||u|} \geq 2\sqrt{n} . $$

□

In order to prove the NP-hardness of 1-SGP for constant alphabets, we also devise a reduction from the vertex cover problem. To this end, let $\mathcal {G} = (V, E)$ be the graph defined above and, without loss of generality, we assume n ≥ 40. We define Σ = {,,◇,⋆, #} and $[\diamond ] = \diamond ^{n^{3}}$. For each i, 1 ≤ i ≤ n, we encode v_i by a word $\overline {v_{i}} \in \{\mathtt {a},\mathtt {b}\}^{\lceil \log (n)\rceil }$ such that $\overline {v_{i}} \neq \overline {v_{j}}$ if and only if i≠j (e. g., by taking $\overline {v_{i}}$ to be the binary representation of i over symbols and with $\lceil \log (n)\rceil $ many digits). We now define the following word over Σ:

$$ \begin{array}{@{}rcl@{}} w &= &\prod\limits_{i=1}^{n}(\# \overline{v_{i}} [\diamond] \overline{v_{i}} \# [\diamond])^{2\left\lceil \log(n) \right\rceil+3} \prod\limits_{i=1}^{n}(\# \overline{v_{i}} \# [\diamond])^{\left\lceil \log(n) \right\rceil +1} \\ &&\prod\limits_{i = 1}^{m}(\# \overline{v_{j_{2i-1}}} \# \overline{v_{j_{2i}}} \# [\diamond])^{2} \star [\diamond]^{n^{3}} . \end{array} $$

First, we show how a vertex cover for $\mathcal {G}$ translates into a grammar for w:

Lemma 2

If there exists a size k vertex cover of $\mathcal {G}$, then there exists a 1-level grammar G with $\mathfrak {D}(G) = w$ and $|G| = 13n\left \lceil \log (n) \right \rceil + 17n + k + 6m + 1 + 2n^{3}$.

Proof

Let ${\Gamma } \subseteq V$ be a size-k vertex cover of $\mathcal {G}$. We define a grammar G = (N,Σ, R, ax) with

$$ \begin{array}{@{}rcl@{}} N &= &\{D, \overset{{~}_{\leftarrow}}{V_{i}}, {\!}_{\rightarrow}{V_{i}}, \overset{{~}_{\leftrightarrow}}{V_{j}} \mid 1 \leq i \leq n, v_{j} \in {\Gamma}\} ,\\ R &= &\{S \to u, D \to [\diamond]\} \cup \{\overset{{~}_{\leftarrow}}{V_{i}} \to \# \overline{v_{i}}, {\!}_{\rightarrow}{V_{i}} \to \overline{v_{i}} \# \mid 1 \leq i \leq n\} \cup \\ &&\{\overset{{~}_{\leftrightarrow}}{V_{j}} \to \# \overline{v_{j}} \# \mid v_{j} \in {\Gamma}\} ,\\ \mathsf{ax} &= &\prod\limits_{i=1}^{n}(\overset{{~}_{\leftarrow}}{V_{i}} D {\!}_{\rightarrow}{V_{i}} D)^{2\left\lceil \log(n) \right\rceil+3} \prod\limits_{i=1}^{n}(y_{i} D)^{\left\lceil \log(n) \right\rceil +1} \prod\limits_{i = 1}^{m}(z_{i} D)^{2} \star D^{n^{3}} , \end{array} $$

where, for every i, 1 ≤ i ≤ n, $y_{i} = \overset {{~}_{\leftrightarrow }}{V_{i}}$ if v_i ∈Γ and $y_{i} = \overset {{~}_{\leftarrow }}{V_{i}} \#$ otherwise, and, for every i, 1 ≤ i ≤ m, $z_{i} = \overset {{~}_{\leftrightarrow }}{V}_{j_{2i-1}} {\!}_{\rightarrow }{V}_{j_{2i}}$ if $v_{j_{2i-1}} \in {\Gamma }$ and $z_{i} = \overset {{~}_{\leftarrow }}{V}_{j_{2i-1}} \overset {{~}_{\leftrightarrow }}{V}_{j_{2i}}$ if $v_{j_{2i-1}} \notin {\Gamma }$ (note that in this case $v_{j_{2i}} \in {\Gamma }$).

Obviously, G is a 1-level grammar and it can be easily verified that $\mathfrak {D}(G) = w$. It remains to determine the size of G. To this end, we first observe that each rule $\overset {{~}_{\leftarrow }}{V_{i}} \to \# \overline {v_{i}}$ and ${\!}_{\rightarrow }{V_{i}} \to \overline {v_{i}} \#$, 1 ≤ i ≤ n, has size of $\lceil \log (n)\rceil + 1$, each rule $\overset {{~}_{\leftrightarrow }}{V_{j}} \to \# \overline {v_{j}} \#$, v_j ∈Γ, has size of $\lceil \log (n)\rceil + 2$, and the rule D → [◇] has size of n³. Hence, the size contributed by these rules is

$$ 2n\lceil\log(n)\rceil + 2n + k\lceil\log(n)\rceil+ 2k + n^{3} . $$

The axiom has size of

$$ \begin{array}{@{}rcl@{}} &&4n(2\left\lceil \log(n) \right\rceil+3) + (3n - k)(\left\lceil \log(n) \right\rceil+1) + 6m + 1 + n^{3}\\ &= &11n\left\lceil \log(n) \right\rceil - k \left\lceil \log(n) \right\rceil + 15n - k + 6m + 1 + n^{3} . \end{array} $$

So the total size is

$$ 13n\left\lceil \log(n) \right\rceil + 17n + k + 6m + 1 + 2n^{3} . $$

□

Next, we take care of the opposite direction, i. e., we show how a vertex cover can be extracted from a grammar for w:

Lemma 3

If there exists a 1-level grammar G with $\mathfrak {D}(G) = w$ and $|G| \leq 13n\left \lceil \log (n) \right \rceil + 17n + k + 6m + 1 + 2n^{3}$, then there exists a size k vertex cover of $\mathcal {G}$.

Proof

Let G = (N,Σ, R, ax) be a smallest 1-level grammar with

$$ |G| \leq 13n\left\lceil \log(n) \right\rceil + 17n + k + 6m + 1 + 2n^{3} $$

and $\mathfrak {D}(G) = w$. We first observe that, since n ≥ 40,

$$ \begin{array}{@{}rcl@{}} 13n\left\lceil \log(n) \right\rceil + 17n + k + 6m + 1 < 19n^{2} + 18n < 20n^{2} = \frac{40}{2}n^{2} \leq \frac{n}{2}n^{2} = \frac{n^{3}}{2} . \end{array} $$

Thus, $|G| < \frac {n^{3}}{2} + 2n^{3} = \frac {5n^{3}}{2}$. Due to the separator symbol ⋆ with only one occurrence in w, we know that the axiom of G has the form $u \star u^{\prime }$. Hence, we can consider all the nonterminals (and their rules) that occur in $u^{\prime }$ as an individual 1-level grammar $G^{\prime }$ for the word $\mathfrak {D}(u^{\prime }) = [\diamond ]^{n^{3}}$ of size n⁶. By Lemma 1, we can conclude that $|G^{\prime }| \geq 2n^{3}$; thus, $2n^{3} \leq |G| < \frac {5n^{3}}{2}$. Claim 1: There is a D ∈ N with D → [◇] and, for every other rule A → x in R, |x|_◇ = 0.

Proof of Claim 1: First, we assume that there is a rule A →◇^ℓ with ℓ > n³. This rule can only be used in order to compress the suffix $[\diamond ]^{n^{3}}$ of w, since the other part of w has no occurrence of a factor ◇^ℓ. Hence, we can replace A →◇^ℓ by the rule $A \to \diamond ^{n^{3}}$ and change the axiom to $u \star A^{n^{3}}$. By Lemma 1, the rule $A \to \diamond ^{n^{3}}$ with axiom $A^{n^{3}}$ compresses the subword $[\diamond ]^{n^{3}}$ optimally which means that this operation does not increase the size of G. Therefore, we conclude that G does not contain a rule A →◇^ℓ with ℓ > n³.

Since w contains at least n³ non-overlapping occurrences of the factor [◇] and since |G| < 3n³, at least one of these factors must be produced by at most 2 nonterminals. This implies that there is a rule B → v with $|v| \geq \frac {|[\diamond ]|}{2} = \frac {n^{3}}{2}$. If v contains a symbol from Σ ∖{◇}, then B → v is not a rule of $G^{\prime }$; thus, by Lemma 1, it follows that $|G| \geq |G^{\prime }| + \frac {n^{3}}{2} \geq 2n^{3} + \frac {n^{3}}{2} = \frac {5n^{3}}{2}$, which is a contradiction. Hence, we can conclude that v ∈{◇}^∗ and we further assume that, among all rules with a right side in {◇}^∗ of size at least $\frac {n^{3}}{2}$, B → v is such that |v| is maximal. Moreover, let |v| = n³ − t, for a $t \in \mathbb {N}$.

We note that, due to the maximality of B → v and the fact that all rules in $G^{\prime }$ have a right side in {◇}^∗, a rule of maximum size in $G^{\prime }$ has size at most n³ − t. In particular, this implies

$$ |u^{\prime}| \geq \frac{n^{6}}{n^{3} - t} > \frac{n^{6} - t^{2}}{n^{3} - t} = \frac{(n^{3} + t)(n^{3} - t)}{n^{3} - t} = n^{3} + t , $$

where $u^{\prime }$ is the right side of the axiom as defined above.

We now remove rule B → v, add the rule D → [◇] and replace part $u^{\prime }$ of the axiom by $D^{n^{3}}$. Since |[◇]| = |v| + t and $|u^{\prime }| \geq n^{3} + t = |D^{n^{3}}| + t$, this does not increase the size of the grammar. However, the rule B → v might have been used in order to produce some of the factors [◇] in the left part u of the axiom of G; thus, since we removed the rule B → v, we have to repair G accordingly.

To this end, we first note that every occurrence of [◇] to the left of ⋆ in w is compressed by a sequence E₁C₁C₂…C_pE₂ of terminals or nonterminals, such that $\mathfrak {D}(E_{1} C_{1} C_{2} {\ldots } C_{p} E_{2}) = x [\diamond ] y$, where E₁ → x ◇^q, q ≥ 1, or E₁ = ε, and E₂ →◇^ry, r ≥ 1, or E₂ = ε. For every such occurrence of [◇] to the left of ⋆ in w, we exchange E₁C₁C₂…C_pE₂ by $E^{\prime }_{1} D E^{\prime }_{2}$, where $E^{\prime }_{1} = \varepsilon $, if E₁ = ε and $E^{\prime }_{1} = x$ if E₁ → x ◇^q, q ≥ 1, and $E^{\prime }_{2} = \varepsilon $, if E₂ = ε and $E^{\prime }_{2} = y$ if E₂ →◇^ry, r ≥ 1. This construction removes rules or shortens them; thus, in order to conclude that the overall size of the grammar does not increase, we only have to observe that the size of the axiom is not increased. To this end, we first observe that if p = 0, then E₁ or E₂ must have a right side of length at least $\frac {n^{3}}{2}$ that contains a symbol from Σ ∖{◇}, but, as shown above, such rules do not exist. Hence, we can assume that p ≥ 1. Furthermore, since E₁ = ε implies $E^{\prime }_{1} = \varepsilon $ and E₂ = ε implies $E^{\prime }_{2} = \varepsilon $, $|E_{1} C_{1} C_{2} {\ldots } C_{p} E_{2}| \geq |E^{\prime }_{1} D E^{\prime }_{2}|$ follows.

We conclude that the overall size of the grammar did not increase due to these modifications. Moreover, G now contains a rule D → [◇] and, since all occurrences of ◇ in w are produced by this rule, we can safely remove all other rules that produce an occurrence of ◇ from the grammar. (Claim 1) $\square $

The statement of the previous claim particularly implies that the axiom of G has the form

$$ \mathsf{ax} = \prod\limits_{i=1}^{n}(\alpha_{i} D \alpha^{\prime}_{i} D)^{2\left\lceil \log(n) \right\rceil+3} \prod\limits_{i=1}^{n}(\beta_{i} D)^{\left\lceil \log(n) \right\rceil +1} \prod\limits_{i = 1}^{m}(\gamma_{i} D)^{2} \star D^{n^{3}} , $$

where $\alpha _{i}, \alpha ^{\prime }_{i}, \beta _{i}, \gamma _{j} \in (N \cup {\Sigma })^{*}$, 1 ≤ i ≤ n, 1 ≤ j ≤ m.

Claim 2: For every i, 1 ≤ i ≤ n, $\alpha _{i} = \overset {{~}_{\leftarrow }}{V_{i}}$, $\alpha ^{\prime }_{i} = {\!}_{\rightarrow }{V_{i}}$, where $\overset {{~}_{\leftarrow }}{V_{i}}, {\!}_{\rightarrow }{V_{i}}$ are nonterminals with rules $\overset {{~}_{\leftarrow }}{V_{i}} \rightarrow \# \overline {v_{i}}$ and ${\!}_{\rightarrow }{V_{i}} \rightarrow \overline {v_{i}} \#$.

Proof of Claim 2: Obviously, for every i, 1 ≤ i ≤ n, $\mathfrak {D}(\alpha _{i}) = \# \overline {v_{i}}$, which means that |α_i| = 1 implies that α_i is a nonterminal with derivative $\# \overline {v_{i}}$. We now assume that |α_i|≥ 2 for some i, 1 ≤ i ≤ n. If we substitute α_i, by a new nonterminal $\overset {{~}_{\leftarrow }}{V_{i}}$ with a rule $\overset {{~}_{\leftarrow }}{V_{i}} \rightarrow \# \overline {v_{i}}$, then we shorten the axiom by at least $2\lceil \log (n)\rceil +3$ and the size of the new rule is $|\# \overline {v_{i}}| = \left \lceil \log (n) \right \rceil + 1$; thus, the overall size of the grammar does not increase. An analogous argument applies if $|\alpha ^{\prime }_{i}| \geq 2$ for some i, 1 ≤ i ≤ n. Consequently, we can assume that we have $\overset {{~}_{\leftarrow }}{V_{i}}, {\!}_{\rightarrow }{V_{i}} \in N$ with rules $ \overset {{~}_{\leftarrow }}{V_{i}} \rightarrow \# \overline {v_{i}}$ and ${\!}_{\rightarrow }{V_{i}} \rightarrow \overline {v_{i}} \#$, and $\alpha _{i} = \overset {{~}_{\leftarrow }}{V_{i}}$, $\alpha ^{\prime }_{i} = {\!}_{\rightarrow }{V_{i}}$, 1 ≤ i ≤ n.(Claim 2) $\square $

We recall that, for every i, 1 ≤ i ≤ n, $\mathfrak {D}(\beta _{i}) = \# \overline {v_{i}} \#$. Hence, if, for some i, 1 ≤ i ≤ n, |β_i|≥ 2, then we can as well replace β_i by $\overset {{~}_{\leftarrow }}{V_{i}} \#$ without increasing the size of the grammar. This implies that, for every i, 1 ≤ i ≤ n, $\beta _{i} = \overset {{~}_{\leftarrow }}{V_{i}} \#$ or $\beta _{i} = \overset {{~}_{\leftrightarrow }}{V_{i}}$ with $\overset {{~}_{\leftrightarrow }}{V_{i}} \to \# \overline {v_{i}} \#$.

Next, recall that, for every j, 1 ≤ j ≤ m, $\mathfrak {D}(\gamma _{i}) = \# \overline {v_{j_{2i-1}}} \# \overline {v_{j_{2i}}} \#$. If, for some i, 1 ≤ i ≤ n, |γ_i|≥ 3, then we can as well replace γ_i by $\overset {{~}_{\leftarrow }}{V}_{j_{2i-1}} \overset {{~}_{\leftarrow }}{V}_{j_{2i}} \#$ without increasing the size of the grammar. If |γ_i| = 1, then there is a rule $E \to \# \overline {v_{j_{2i-1}}} \# \overline {v_{j_{2i}}} \#$ of size $2 \lceil \log (n)\rceil + 3$. If we now replace γ_i by $\overset {{~}_{\leftarrow }}{V}_{j_{2i-1}} \overset {{~}_{\leftarrow }}{V}_{j_{2i}} \#$, then we increase the size of the axiom (and therefore of the grammar) by 4. However, since there are no other occurrences of $\# \overline {v_{j_{2i-1}}} \# \overline {v_{j_{2i}}} \#$ in w, there are no other occurrences of E in the axiom; thus, we can remove the rule $E \to \# \overline {v_{j_{2i-1}}} \# \overline {v_{j_{2i}}} \#$, which decreases the size of the grammar by $2 \lceil \log (n)\rceil + 3 \geq 4$. Hence, the overall size of the grammar does not increase. If |γ_i| = 2, then γ_i = E₁E₂ with $E_{1} \to \# \overline {v_{j_{2i-1}}} \# x$ or $E_{2} \to x \# \overline {v_{j_{2i}}} \#$. Let us assume that there is a rule $E_{1} \to \# \overline {v_{j_{2i-1}}} \# x$ (the case $E_{2} \to x \# \overline {v_{j_{2i}}} \#$ is analogous). If we now change this rule to $E_{1} \to \# \overline {v_{j_{2i-1}}} \#$ and substitute every E₂ by ${\!}_{\rightarrow }{V}_{j_{2i}}$, then the size of the grammar does not increase (note that the nonterminals E₁ and E₂ can only occur in some γ_j, which has been replaced in this way).

These considerations demonstrate that we can assume that, in addition to the rule D → [◇], the rules of G are $\overset {{~}_{\leftarrow }}{V_{i}} \rightarrow \# \overline {v_{i}}$, ${\!}_{\rightarrow }{V_{i}} \to \overline {v_{i}}\#$, 1 ≤ i ≤ n, and rules $\overset {{~}_{\leftrightarrow }}{V_{i}} \to \#\overline {v_{i}}\#$ with $i \in \mathfrak {I}$, for some $\mathfrak {I} \subseteq \{1, 2, \ldots , n\}$. We now define $\ell = |\mathfrak {I}|$ and the vertex set $\mathcal {V} = \{v_{i} \mid i \in \mathfrak {I}\}$; furthermore, let t be the number of edges from $\mathcal {G}$ that are covered by some vertex of $\mathcal {V}$. The axiom has the following form:

$$ \mathsf{ax} = \prod\limits_{i=1}^{n}(\overset{{~}_{\leftarrow}}{V_{i}} D {\!}_{\rightarrow}{V_{i}} D)^{2\left\lceil \log(n) \right\rceil+3} \prod\limits_{i=1}^{n}(y_{i} D)^{\left\lceil \log(n) \right\rceil +1} \prod\limits_{i = 1}^{m}(z_{i} D)^{2} \star D^{n^{3}} , $$

where, for every i, 1 ≤ i ≤ n, $y_{i} = \overset {{~}_{\leftrightarrow }}{V_{i}}$ if $v_{i} \in \mathcal {V}$ and $y_{i} = \overset {{~}_{\leftarrow }}{V_{i}} \#$ otherwise, and, for every i, 1 ≤ i ≤ m, $z_{i} = \overset {{~}_{\leftarrow }}{V}_{j_{2i-1}} \overset {{~}_{\leftarrow }}{V}_{j_{2i}} \#$, if the edge $(v_{j_{2i-1}}, v_{j_{2i}})$ is not covered by $\mathcal {V}$, $z_{i} = \overset {{~}_{\leftrightarrow }}{V}_{j_{2i-1}} {\!}_{\rightarrow }{V}_{j_{2i}}$ or $z_{i} = \overset {{~}_{\leftarrow }}{V}_{j_{2i-1}} \overset {{~}_{\leftrightarrow }}{V}_{j_{2i}}$, if $v_{j_{2i-1}} \in \mathcal {V}$ or $v_{j_{2i}} \in \mathcal {V}$, respectively.

The total size of the rules is

$$ 2n\lceil \log(n) \rceil + 2n + \ell\lceil \log(n) \rceil + 2\ell + n^{3} . $$

Moreover,

$$ \begin{array}{@{}rcl@{}} |\mathsf{ax}|& = &4n(2\left\lceil \log(n) \right\rceil + 3) + (\left\lceil \log(n) \right\rceil + 1)(3n - \ell)) + 6t + 8(m - t) + 1 + n^{3} \\ &= &11n\left\lceil \log(n) \right\rceil + 15n - \ell\left\lceil \log(n) \right\rceil - \ell + 8m - 2t + 1 + n^{3} . \end{array} $$

Consequently, $|G| = 13n\left \lceil \log (n) \right \rceil + 17n + \ell + 8m - 2t + 1 + 2n^{3}$. Since, by assumption, $|G| \leq 13n\left \lceil \log (n) \right \rceil + 17n + k + 6m + 1 + 2n^{3}$, we conclude that ℓ + 8m − 2t ≤ k + 6m. From this inequality, since t ≤ m, we can deduce ℓ ≤ k on the one hand and also $m - \frac {k-\ell }{2} \leq t$ on the other.

Consequently, the vertex set $\mathcal {V}$ covers already $m - \frac {k-\ell }{2}$ edges of $\mathcal {G}$. This implies that we can extend $\mathcal {V}$ to a vertex cover $\mathcal {V}^{\prime }$ for $\mathcal {G}$ by adding q vertices, where $q \leq \frac {k-\ell }{2} \leq k-\ell $. Since $|\mathcal {V}| = \ell $, $|\mathcal {V}^{\prime }| \leq |\mathcal {V}| + q \leq \ell + k-\ell = k$. □

From Lemmas 2 and 3, we can directly conclude the following theorem:

Theorem 2

1-SGP is NP-complete, even for |Σ| = 5.

3.2 The Multi-Level Case

In the above reduction for the 1-level case, the main difficulty is the use of unary factors as separators. However, once those separators are in place, we know the factors of w that are produced by nonterminals and, for a smallest 1-level grammar, this already fully determines the axiom and therefore also the grammar itself. For the multi-level case, the situation is much more complicated. Even if we manage to force the axiom to factorise w into parts that are either separators or codewords of vertices, this only determines the top-most level of the grammar and we do not necessarily know how these single factors are further hierarchically compressed and, more importantly, the dependencies between these compressions (i. e., how they share the same rules).

To deal with these issues, we rely on a larger alphabet Σ and we use palindromic codewords u ⋆ u^R, where ⋆ ∈Σ and u is a word over an alphabet of size 7 representing a 7-ary number. The purpose of the palindromic structure is twofold. Firstly, it implies that codewords always start and end with the same symbol, which, in the construction of w, makes it easier to avoid the situation that an overlapping between neighbouring codewords is repeated elsewhere in w (see Lemma 4). Secondly, if all codewords are produced by individual nonterminals, then we can show that they are produced best “from the middle”, similar to the rules of the example grammar G₂ from Section 2.3. In addition to this, we also need a vertex colouring and an edge colouring of certain variants of the graph to be encoded.

In order to formally define the reduction, we first give some preparatory definitions. Let

$$ {\Sigma}=\{x_{1},\dots,x_{7}, d_{1},\dots,d_{7}, \star,\#, {\cent}_{1},{\cent}_{2},\$_{1},\dots, \$_{6}\} $$

be an alphabet of size 24. The function $M \colon \mathbb {N}\times \mathbb {N}\rightarrow \mathbb {N}$ is defined by

$$ M(q,k):=\min\{r>0\mid \exists \ t\in \mathbb{N}\colon q=tk+r\} $$

(note that M is the positive modulo-function, i. e., M(q, k) = q%k, if q%k≠ 0 and M(q, k) = k, otherwise). Let the functions $f\colon \mathbb {N} \rightarrow \{x_{1},\dots ,x_{7}\}^{+}$ and $g\colon \mathbb {N} \rightarrow \{d_{1},\dots ,d_{7}\}^{+}$ be defined by

$$ \begin{array}{@{}rcl@{}} f(q) &:= &x_{a_{0}} x_{a_{1}}{\dots} x_{a_{k}} \text{ and}\\ g(q) &:= &d_{a_{0}} d_{a_{1}}{\dots} d_{a_{k}} , \end{array} $$

for every $q \in \mathbb {N}$, where $k \in \mathbb {N} \cup \{0\}$ and a_i ∈{1,2,…,7}, 0 ≤ i ≤ k, such that $q={\sum }^{k}_{i=0} a_{i} 7^{i}$ is satisfied. Note that since, for every $q \in \mathbb {N}$, there are unique $k \in \mathbb {N}$ and a_i ∈{1,2,…,7}, 1 ≤ i ≤ k, such that $q={\sum }^{k}_{i \geq 0} a_{i} 7^{i}$, the functions f and g are well-defined.

For every $i \in \mathbb {N}$, let 〈i〉_v := f(i) ⋆ f(i)^R and 〈i〉_◇ := g(i) ⋆ g(i)^R. The factors 〈i〉_v and 〈i〉_◇ are called codewords; 〈i〉_v represents a vertex v_i, while the 〈i〉_◇ are used as separators.

Observation 1

The functions f and g are bijections and they are 7-ary representations of the integers n > 0 (least significant digit first). Thus, for any $n \in \mathbb {N} \cup \{0\}$, g(7n + i)[1] = d_i and f(7n + i)[1] = x_i, 1 ≤ i ≤ 7. In particular, this means that $\{g(n+i)[1]\mid 0\leq i \leq 6\}=\{d_{1},\dots ,d_{7}\}$ and $\{f(n+i)[1]\mid 0\leq i \leq 6\}=\{x_{1},\dots ,x_{7}\}$, for every $n \in \mathbb {N}$. Consequently, for every $n, n^{\prime } \in \mathbb {N}$ with $M(n, 7) \neq M(n^{\prime }, 7)$, the factors 〈n〉_v and $\langle n^{\prime } \rangle _{v}$ do not share any prefixes or suffixes (and the same holds for the words 〈n〉_◇).

Let $\mathcal {G}=(V,E)$ be a subcubic graph (i. e., a graph with maximum degree 3) with $V=\{v_{1},\dots ,v_{n}\}$ and $E=\{\{v_{j_{2i-1}},v_{j_{2i}}\}\mid 1 \leq i \leq m\}$ (note that the vertex cover problem remains NP-hard if restricted to subcubic graphs (see [49])). Let $\mathcal {G}^{\prime }=(V,E^{\prime })$ be the multi-graph defined by

$$ E^{\prime}:=\left\{\{v_{j_{2i}},v_{j_{2i+1}}\}\mid 1 \leq i \leq m-1\right\} . $$

By [50], it is possible to compute in polynomial time a proper edge-colouring (meaning a colouring such that no two edges which share one or two vertices have the same colour) for a multi-graph with at most $\lfloor \tfrac 32 m\rfloor $ colours, where m is the maximum degree of the multi-graph. Since the graph $\mathcal {G}$ is subcubic, the maximum degree of $\mathcal {G}^{\prime }$ is three and we can compute a proper edge-colouring $C_{e}\colon E^{\prime }\rightarrow \{1,2,3,4\}$ for $\mathcal {G}^{\prime }$ with colours {1,2,3,4}. Let $\mathcal {G}^{2}=(V,E^{\prime \prime })$ be the graph defined by

$$ E^{\prime\prime}=\left\{\{u,v\}\mid \ \{u,w\},\{w,v\}\in E\text{ for some } w\in V\!\setminus\!\{u,v\}, u\not= v\right\} . $$

Since $\mathcal {G}$ is subcubic, $\mathcal {G}^{2}$ has maximum degree at most six. Let $C_{v}\colon \{1,\dots ,n\}\rightarrow \{1,2,3,4,5,6,7\}$ be a proper vertex-colouring (defined over the vertex-indices of $V=\{v_{1},\dots ,v_{n}\}$) for $\mathcal {G}^{2}$ with colours {1,2,3,4,5,6,7}. Such a colouring can be computed by an algorithmic version of Brook’s theorem [51].

Let $w_{\mathcal {G}} = u v w$ be the word representing $\mathcal {G}$, where u, v, w ∈Σ⁺ are defined as follows (note that $m \leq \frac {3n}{2}$, so 7m < 14n in the word w).

$$ u = \prod\limits_{j=0}^{6} \left( {\prod}_{i=1}^{14n} (\langle i \rangle_{\diamond} \langle M(i+j,14n) \rangle_{v})\right) \$_{1} $$

$$ \begin{array}{@{}rcl@{}} v &=& \prod\limits_{i=1}^{n} \left( \# \langle 7i+C_{v}(i) \rangle_{v} {\cent}_{1} \langle 7i-1 \rangle_{\diamond}\right) \$_{2} \prod\limits_{i=1}^{n} \left( \# \langle 7i+C_{v}(i) \rangle_{v} {\cent}_{2} \langle 7i-2 \rangle_{\diamond}\right) \$_{3}\\ &&\prod\limits_{i=1}^{n} \left( \langle 7i+C_{v}(i) \rangle_{v} \# \langle 7i-2 \rangle_{\diamond} {\cent}_{1}\right) \$_{4} \prod\limits_{i=1}^{n} \left( \langle 7i+C_{v}(i) \rangle_{v} \# \langle 7i-1 \rangle_{\diamond} {\cent}_{2}\right) \$_{5}\\ &&\prod\limits_{i=1}^{n} \left( \# \langle 7i+C_{v}(i) \rangle_{v} \# \langle 7i \rangle_{\diamond}\right) \$_{6} \end{array} $$

$$ \begin{array}{@{}rcl@{}} w = \prod\limits_{i=1}^{m-1} (\!\!\!\!\!&&\# \langle 7j_{2i-1}+C_{v}(j_{2i-1}) \rangle_{v} \# \langle 7j_{2i}+C_{v}(j_{2i}) \rangle_{v} \# \langle 7i+C_{e}(v_{j_{2i}},v_{j_{2i+1}}) \rangle_{\diamond} )\\ &&\# \langle 7j_{2m-1}+C_{v}(j_{2m-1}) \rangle_{v} \# \langle 7j_{2m}+C_{v}(j_{2m}) \rangle_{v} \# \end{array} $$

This concludes the definition of the reduction. Since the following proof of correctness is very complicated, we first present a corresponding “road-map”, to make it more accessible:

First, and completely independent from the question of how a grammar could compress $w_{\mathcal {G}}$, we take a closer look at the structure of this word. More precisely, in Propositions 1 and 2, we show that if a factor of $w_{\mathcal {G}}$ spans over the symbol ⋆ of some codeword 〈i〉_v or 〈i〉_◇ and also reaches over the boundaries of this codeword into some other factor, then it is not repeated in $w_{\mathcal {G}}$. This property is the main reason for the complicated structure of $w_{\mathcal {G}}$ (especially the factor v).
An immediate consequence of the property described in the previous point, is that in a smallest grammar, any nonterminal that derives a factor with an occurrence of ⋆ necessarily derives a factor that is completely contained in some codeword 〈i〉_◇ or in some codeword 〈i〉_v delimited by two occurrences of the symbol # (see Lemma 4).
Next, we show that we can assume that in a smallest grammar, there are nonterminals that have exactly our codewords as derivatives (see Lemma 5).
The next result (Lemma 6) states that we can also assume that in a smallest grammar there are nonterminals with derivative #〈7i + C_v(i)〉_v and nonterminals with derivative 〈7i + C_v(i)〉_v#.
Finally, we are able to fix the structure of a smallest grammar (Lemma 7) and we can show that, just like in the reduction from [33, 34] (see Page 16), the set of rules that derive factors of the form #〈7i + C_v(i)〉_v# can be transformed into a vertex cover (see Lemma 8).

The following simple, but crucial observation shall be helpful throughout the proof of correctness:

Observation 2

The word $w_{\mathcal {G}}$ contains each of the symbols $₁,…, $₆ exactly once, which implies that any smallest grammar for $w_{\mathcal {G}}$ has an axiom of the form ${\prod }^{6}_{i=1}(\beta _{i} \$_{i}) \beta _{7}$, $\beta _{i} \in ((V\cup {\Sigma }) \setminus \{\$_{1},\dots ,\$_{6}\})^{+}$, 1 ≤ i ≤ 7.

We now prove the two propositions that establish the property with respect to the repetitions of factors containing ⋆.

Proposition 1

For every i, 1 ≤ i ≤ 14n, and j, 1 ≤ j ≤ 7, the word $w_{\mathcal {G}}$ contains at most one occurrence of a factor of the form

$$ \begin{array}{@{}rcl@{}} \star f(i)^{R} d_{j}, \qquad d_{j} f(i) \star, \qquad \star g(i)^{R} x_{j}, \qquad x_{j} g(i) \star . \end{array} $$

Furthermore, if such a factor occurs in $w_{\mathcal {G}}$, then the occurrence is in u.

Proof

We first note that factors of the form stated in the lemma can only occur in factors of the form $\langle i \rangle _{v} \langle i^{\prime } \rangle _{\diamond }$ or $\langle i \rangle _{\diamond } \langle i^{\prime } \rangle _{v}$. Since such factors only occur in u, the second statement of the proposition holds.

We first take care of factors of the form $\langle i \rangle _{v} d_{j^{\prime }}$, 1 ≤ i ≤ 14n, $1 \leq j^{\prime } \leq 7$. These factors are subwords of 〈M(x + j,14n)〉_v〈x + 1〉_◇ for some $j\in \{0,\dots ,6\}$ and x such that i = M(x + j,14n), which for each choice of pair (j, x) occur at most once in u. For every i, 6 < i ≤ 14n, this gives the seven choices (j, i − j) with 0 ≤ j ≤ 6; note that i = M(x + j,14n) implies x = i − j. This shows that the word u contains the subword 〈i〉_vg(x + 1)[1] = 〈i〉_vg(i − j + 1)[1] once for each j, 0 ≤ j ≤ 6, and these are the only occurrences of a subword of the form $\langle i \rangle _{v} d_{j^{\prime }}$ for some $j^{\prime }\in \{1,\dots ,7\}$ in u. Since $\{g(i-j+1)[1] \mid 0 \leq j \leq 6\}=\{d_{1},\dots ,d_{7}\}$ by Observation 1, it follows that no subword of the form $\langle i \rangle _{v}d_{j^{\prime }}$ with $j^{\prime }\in \{1,\dots ,7\}$ appears in u more than once. For every i, 1 ≤ i ≤ 6, the choices of pairs (j, x) shift x by taking the modulo and are (j, i − j) for 0 ≤ j < i and (j,14n − j + i) for i ≤ j ≤ 6. The word u hence contains the subword 〈i〉_vg(i − j + 1)[1] once for each j, 0 ≤ j < i, the subword 〈i〉_vg(14n − j + i + 1)[1] once for each j, i ≤ j ≤ 6, and these are the only occurrences of a subword of the form $\langle i \rangle _{v} d_{j^{\prime }}$ for some $j^{\prime }\in \{1,\dots ,7\}$ in u. By reducing the 14n modulo 7 to zero, shifting by + 7 and substituting j by 7 − r we get that {g(14n − j + i + 1)[1]∣i ≤ j ≤ 6} = {g(i + 1 + r)[1]∣1 ≤ r ≤ 7 − i} and {g(i − j + 1)[1]∣0 ≤ j < i} = {g(i + 1 + r)[1]∣7 − i < r ≤ 7}. By Observation 1 we can hence conclude that each subword of the form $\langle i \rangle _{v}d_{j^{\prime }}$ with $j^{\prime }\in \{1,\dots ,7\}$ appears in u mat most once. Note that for i = 6, the factor 〈i〉_vg(14n − j + i + 1)[1] for the only choice j = 6 does not show up, as in this case u ends and 〈6〉_v is followed by $₁. Consequently, for every i, 1 ≤ i ≤ 14n, every factor ⋆ f(i)^Rd_j, 1 ≤ j ≤ 7, has at most one occurrence in u.

Analogously, we can show that, for every i, 1 ≤ i ≤ 14n, every factor d_jf(i) ⋆, 1 ≤ j ≤ 7, has at most one occurrence in u. More precisely, it is sufficient to observe that, for every 6 < i ≤ 14n, the word u contains the subword g(i − j)[1]〈i〉_v once for each j, 0 ≤ j ≤ 6; for every 1 ≤ i ≤ 6, the subword g(i − j)[1]〈i〉_v once for each j, 0 ≤ j ≤ i − 1, and the subword g(14n − j)[1]〈i〉_v once for each j, 0 ≤ j ≤ 6 − i. As before, these are the only occurrences of a subword of the form $d_{j^{\prime }} \langle i \rangle _{v}$ for some $j^{\prime }\in \{1,\dots ,7\}$ in u.

For every i, 1 ≤ i ≤ 14n, there are exactly 7 factors of the form ⋆ g(i)^Rx_j, for some j, 1 ≤ j ≤ 7. Let $\star g(i) x_{j_{\ell }}$, 1 ≤ ℓ ≤ 7, be these 7 factors. By the structure of u, we observe that {j_ℓ∣1 ≤ ℓ ≤ 7} = {x₁, x₂,…, x₇}, which directly implies that, for every i, 1 ≤ i ≤ 14n, every factor $\star g(i)^{R} x_{j_{\ell }}$, 1 ≤ ℓ ≤ 7, has at most one occurrence in u. Analogously, we can show that, for every i, 1 < i ≤ 14n, every factor of the form x_jg(i) ⋆, 1 ≤ j ≤ 7, has at most one occurrence in u. Finally, there are exactly 6 factors of the form x_jg(1) ⋆, 1 ≤ j ≤ 7, namely the factors f(14n)[1]g(1) ⋆ and f(j)[1]g(1) ⋆, 1 ≤ j ≤ 5. Since {f(14n)[1], f(j)[1]∣1 ≤ j ≤ 5} = {x₇, x₁, x₂,…, x₅}, it follows that every factor of the form x_jg(1) ⋆, 1 ≤ j ≤ 7, has at most one occurrence in u. □

Proposition 2

For every i, 1 ≤ i ≤ 14n, and j, 1 ≤ j ≤ 7, the word $w_{\mathcal {G}}$ contains at most one occurrence of a factor of the form

$$ \begin{array}{@{}rcl@{}} &&\star g(i)^{R} y,\qquad y g(i) \star, \qquad \star f(i)^{R} z, \qquad z f(i) \star,\\ &&d_{j} \# f(i) \star, \qquad \star f(i)^{R} \# d_{j}, \star f(i)^{R} \# x_{j}, \quad x_{j} \# f(i) \star , \end{array} $$

where y ∈Σ∖{d₁,…, d₇} and z ∈Σ∖{x₁,…, x₇, #}.

Proof

We first consider the factors ⋆ g(i)^Ry with y ∈Σ∖{d₁,…, d₇}. In the case y ∈{x₁,…, x₇}, Proposition 1 shows that such factors have at most one occurrence in $w_{\mathcal {G}}$. For y ∈{⋆, #,¢₁,¢₂, $₁,…, $₆}, there are occurrences of factors of the form ⋆ g(i)^Ry in v and in w, but not in u. We note that each two occurrences of factors ⋆ g(i)^Ry and $\star g(i^{\prime })^{R} y^{\prime }$ in w satisfy $i \neq i^{\prime }$ and are therefore different. Moreover, all factors ⋆ g(i)^Ry in w satisfy g(i)[1] ∈{1,2,3,4} (this is due to the colouring C_e). We next observe that all factors ⋆ g(i)^Ry in v satisfy $i \in \{7i^{\prime }, 7i^{\prime }-1, 7i^{\prime }-2 \mid i^{\prime } \in \mathbb {N}\}$, which implies that for these factors, we have g(i)[1] ∈{5,6,7}; thus, they all differ from the factors ⋆ g(i)^Ry in w. Consequently, if a factor of the form ⋆ g(i)^Ry repeats, then there must be individual occurrences of factors 〈i〉_◇y and $\langle i \rangle _{\diamond } y^{\prime }$ in v. This is only the case for $i = 7i^{\prime } - 1$, but then there are exactly two such factors and with y ∈{#, $₂}, $y^{\prime } = {\cent}_{2}$, or for $i = 7i^{\prime } - 2$, but then there are exactly two such factors and with y ∈{#, $₃}, $y^{\prime } = {\cent}_{1}$. This shows that each factor ⋆ g(i)^Ry with y ∈Σ∖{d₁,…, d₇} has at most one occurrence in $w_{\mathcal {G}}$. For the factors yg(i) ⋆ the argument is the same up to the point where we consider individual occurrences of factors y〈i〉_◇ and $y^{\prime } \langle i \rangle _{\diamond }$ in v. Again, this is only possible for $i = 7i^{\prime } - 1$ or $i = 7i^{\prime } - 2$, but in the first case, we have y = ¢₁, $y^{\prime } = \#$, while in the second case, we have y = ¢₂, $y^{\prime } = \#$.

We next turn to the factors ⋆ f(i)^Rz with z ∈Σ∖{x₁,…, x₇, #}. Again, Proposition 1 shows that for y ∈{d₁,…, d₇} such factors have at most one occurrence in $w_{\mathcal {G}}$; thus, we consider the case y ∈{⋆,¢₁,¢₂, $₁,…, $₆}. We first note that such factors have no occurrence in u. Moreover, for every i, 1 ≤ i ≤ 14n, any factor of the form 〈i〉_vy with y∉{d₁,…, d₇, x₁,…, x₇} has either no occurrence in vw, or exactly 5 occurrences in v and at most 3 occurrences in w (this is due to the fact that $\mathcal {G}$ is subcubic). However, y is equal to # for all but two of those occurrences, where one occurrence is with y = ¢₁ and the other with y = ¢₂. Consequently, each factor ⋆ f(i)^Rz with z ∈Σ∖{x₁,…, x₇, #} has at most one occurrence in $w_{\mathcal {G}}$. The argument for the factors zf(i) ⋆ with z ∈Σ∖{x₁,…, x₇, #} is analogous, with the difference that the only two occurrences of a factor y〈i〉_v in v with y∉{d₁,…, d₇, x₁,…, x₇, #} are once with y ∈{$₃,¢₁} and once with y ∈{$₄,¢₂}.

We next consider the factors d_j#f(i) ⋆ and first note that such a factor only occurs in a factor #〈i〉_v that is preceded by a factor $\langle i^{\prime } \rangle _{\diamond }$, for some $i^{\prime }$, $1 \leq i^{\prime } \leq 14n$, and that such factors only occur in v or w. In v, there are either no or exactly 3 occurrences of #〈i〉_v. The first one is either a prefix of v or preceded by 〈7ℓ − 1〉_◇, 1 ≤ ℓ ≤ n, the second is preceded by either $₂ or 〈7ℓ − 2〉_◇, 1 ≤ ℓ ≤ n, and the third one is preceded by either $₅ or 〈7ℓ〉_◇, 1 ≤ ℓ ≤ n. Hence, these three occurrences are preceded by symbols d₆, d₅ and d₇, respectively (or by symbols not in {d₁,…, d₇}). Consequently, the factor d_j#f(i) ⋆ is not repeated in v and if it occurs, j ∈{5,6,7} holds. Next, we note that every #〈i〉_v in w that is preceded by a $\langle i^{\prime } \rangle _{\diamond }$, satisfies $i^{\prime } = 7\ell + C_{e}(v_{j_{2\ell }}, v_{j_{2\ell + 1}})$, and since the range of C_e is {1,2,3,4}, this occurrence of #〈i〉_v is preceded by symbol d₁, d₂, d₃ or d₄. Finally, we have to show that no d_j#〈i〉_v is repeated in w. To this end, we assume that d_j#〈i〉_v with j ∈{1,2,3,4} is repeated. This implies that there are $k, k^{\prime }$, $1 \leq k < k^{\prime } \leq m-1$, with $j_{2k-1} = j_{2k^{\prime }-1} = i$, and, furthermore, $\langle 7(k - 1) + C_{e}(v_{j_{2(k-1)}}, v_{j_{2(k - 1)+1}}) \rangle _{\diamond }$ and $\langle 7(k^{\prime } - 1) + C_{e}(v_{j_{2(k^{\prime }-1)}}, v_{j_{2(k^{\prime } - 1)+1}}) \rangle _{\diamond }$ both end with symbol d_j. Thus, $C_{e}(v_{j_{2(k-1)}}, v_{j_{2(k - 1)+1}}) = C_{e}(v_{j_{2(k^{\prime }-1)}}, v_{j_{2(k^{\prime } - 1)+1}}) = j$, which is a contradiction, since the edges $(v_{j_{2(k-1)}}, v_{j_{2(k - 1)+1}})$ and $(v_{j_{2(k^{\prime }-1)}}, v_{j_{2(k^{\prime } - 1)+1}})$ of $\mathcal {G}^{\prime }$ are incident with the same vertex $v_{j_{2k-1}} = v_{j_{2k^{\prime }-1}} = v_{i}$ and C_e is a proper edge colouring for $\mathcal {G}^{\prime }$. Consequently, no d_j#〈i〉_v is repeated in w; thus, the word $w_{\mathcal {G}}$ contains at most one occurrence of a factor of the form d_j#f(i) ⋆.

In an analogous way, we can show that every factor of form ⋆ f(i)^R#d_j in v satisfies j ∈{5,6,7} and in w it satisfies j ∈{1,2,3,4}. That these factors do not repeat follows from the fact that ⋆ f(i)^R# occurs at most 3 times in v (followed by the different symbols d₅, d₆ and d₇) and the repetitions of ⋆ f(i)^R# in w are followed by distinct symbols from {d₁, d₂, d₃, d₄} due to the proper edge colouring C_e of $\mathcal {G}^{\prime }$. Thus, the word $w_{\mathcal {G}}$ contains at most one occurrence of a factor of the form ⋆ f(i)^R#d_j.

For any i, 1 ≤ i ≤ 14n, and j, 1 ≤ j ≤ 7, the factor ⋆ f(i)^R#x_j only occurs in w and only in a factor of the form $\langle 7\ell +C_{v}(\ell ) \rangle _{v} \# \langle 7\ell ^{\prime }+C_{v}(\ell ^{\prime }) \rangle _{v}$, $1 \leq \ell , \ell ^{\prime } \leq n$, with i = 7ℓ + C_v(ℓ) and $f(7\ell ^{\prime }+C_{v}(\ell ^{\prime }))[1] = x_{j}$. Hence, if ⋆ f(i)^R#x_j has two occurrences, then there are $\ell ^{\prime }, \ell ^{\prime \prime }$, $1 \leq \ell ^{\prime }, \ell ^{\prime \prime } \leq n$, such that the vertices $v_{\ell ^{\prime }}$ and $v_{\ell ^{\prime \prime }}$ are neighbours of v_ℓ (in $\mathcal {G}$), and $f(7\ell ^{\prime }+C_{v}(\ell ^{\prime }))[1] = f(7\ell ^{\prime \prime }+C_{v}(\ell ^{\prime \prime }))[1] = x_{j}$, which implies $C_{v}(\ell ^{\prime }) = C_{v}(\ell ^{\prime \prime }) = j$. This is a contradiction to the fact that C_v is a proper vertex colouring for the graph $\mathcal {G}^{2}$. In an analogous way, it follows that the factor x_j#f(i) ⋆ is not repeated. □

Since a smallest grammar does not contain rules which produce a factor which is not repeated, Propositions 1 and 2 yield the following:

Lemma 4

For every smallest grammar G = (N,Σ, R, ax) for $w_{\mathcal {G}}$, $\mathfrak {D}(A)_{\star } \geq 1$ for some A ∈ N implies that $\mathfrak {D}(A)$ is a factor of some #〈7i + C_v(i)〉_v#, 1 ≤ i ≤ n, or a factor of some 〈j〉_v, 1 ≤ j ≤ 14n, or a factor of some 〈j〉_◇, 1 ≤ j ≤ 14n.

The main consequence of Lemma 4 is that, in a smallest grammar, the axiom has a length of at least the number of occurrences of ⋆ in $w_{\mathcal {G}}$. This allows us to show that, without increasing the size of the grammar, the axiom can be restructured, such that each individual codeword is produced by its own nonterminal.

Lemma 5

There is a smallest grammar G for $w_{\mathcal {G}}$ such that, for every i, 1 ≤ i ≤ 14n, there is a nonterminal with derivative 〈i〉_◇ and a nonterminal with derivative 〈i〉_v.

Proof

Let G = (N,Σ, R, ax) be a smallest grammar with $\mathfrak {D}(G) = w_{\mathcal {G}}$. We shall first show how G can be modified in such a way that, for every i, 1 ≤ i ≤ 14n, there is a nonterminal with derivative 〈i〉_◇. To this end, we assume that for some $\mathfrak {I}_{\diamond } \subseteq \{1, 2, \ldots , 14n\}$ and every i, 1 ≤ i ≤ 14n, there currently is a nonterminal in G with derivative 〈i〉_◇ if and only if $i \in \mathfrak {I}_{\diamond }$; furthermore, let $\overline {\mathfrak {I}_{\diamond }} = \{1, 2, \ldots , 14n\} \setminus \mathfrak {I}_{\diamond }$. For the sake of concreteness, for every $i \in \mathfrak {I}_{\diamond }$, let $\widehat {D}_{i}$ be the nonterminal with $\mathfrak {D}(\widehat {D}_{i}) = \langle i \rangle _{\diamond }$.

We now recursively define a set of rules R_◇ := {r_{◇, i}∣1 ≤ i ≤ 14n} for nonterminals D_i, 1 ≤ i ≤ 14n, by $r_{\diamond , i} := D_{i} \rightarrow d_{i} \star d_{i}$, 1 ≤ i ≤ 7, and $r_{\diamond , i} := D_{i} \rightarrow g(i)[1] D_{h(i)} g(i)[1]$, 8 ≤ i ≤ 14n, where $h(i):= \frac {i-M(i,7)}{7}$. Obviously, $\mathfrak {D}(D_{i}) = \langle i \rangle _{\diamond }$, 1 ≤ i ≤ 14n. We modify G by the following algorithm. For every i = 1,2,…,14n, if $i \in \overline {\mathfrak {I}_{\diamond }}$, then we add the rule D_i from R_◇ to G, and if $i \in \mathfrak {I}_{\diamond }$, then we replace the rule $\widehat {D}_{i} \to \alpha $ by D_i → α. Furthermore, we can carry out an analogous modification with respect to derivatives 〈i〉_v. More precisely, we define $\mathfrak {I}_{v} \subseteq \{1, 2, \ldots , 14n\}$ to be such that, for exactly the $i \in \mathfrak {I}_{v}$, there is a nonterminal with derivative 〈i〉_v. Then, in the same way as above, we can add rules from the set R_v := {r_{v, i}∣1 ≤ i ≤ 14n}, where $r_{v, i} := V_{i} \rightarrow x_{i} \star x_{i}$, 1 ≤ i ≤ 7, and $r_{v, i} := V_{i} \rightarrow f(i)[1] V_{h(i)} f(i)[1]$, 8 ≤ i ≤ 14n, where $h(i):= \frac {i-M(i,7)}{7}$.

We denote this modified grammar by $G^{\prime }$ and note that, by the considerations from above, for every i, 1 ≤ i ≤ 14n, $G^{\prime }$ contains nonterminals D_i and V_i with

$$ \mathfrak{D}(D_{i}) = \langle i \rangle_{\diamond} \text{ and } \mathfrak{D}(V_{i}) = \langle i \rangle_{v}, 1 \leq i \leq 14n . $$

Moreover, since every rule from R_◇ and R_v has size 3, $|G^{\prime }| = |G| + 3(|\overline {\mathfrak {I}_{\diamond }}| + |\overline {\mathfrak {I}_{v}}|)$. In the remainder of this proof, we show that this size increase can be compensated by using the new rules in order to significantly shorten the axiom. Hence, we obtain a smallest grammar, with the properties claimed in the lemma. To this end, we first measure the size of the axiom of the original grammar G.Claim 1: $\mathsf {ax} = {\prod }^{6}_{i=1}(\beta _{i} \$_{i}) \beta _{7}$, where $\beta _{i}\in ((N\cup {\Sigma }) \setminus \{\$_{1},\dots ,\$_{6}\})^{+}$, 1 ≤ i ≤ 7, and β₁ contains at least 196n occurrences of symbols (terminal or nonterminal) that each produces exactly one occurrence of ⋆.

Proof of Claim 1: From Observation 2, it follows that $\mathsf {ax} = {\prod }^{6}_{i=1}(\beta _{i} \$_{i}) \beta _{7}$, $\beta _{i}\in ((N\cup {\Sigma }) \setminus \{\$_{1},\dots ,\$_{6}\})^{+}$, 1 ≤ i ≤ 7. Furthermore, β₁ contains at least |u|_⋆ symbols (terminal or nonterminal), since otherwise at least two occurrences of ⋆ of u are produced by the same nonterminal, which is a contradiction to Lemma 4. Hence, β₁ contains at least 196n occurrences of symbols that each produces exactly one occurrence of ⋆. (Claim 1) $\square $Claim 2: There are at least $7\lceil \frac {|\overline {\mathfrak {I}_{\diamond }}| + |\overline {\mathfrak {I}_{v}}|}{2}\rceil $ occurrences of symbols in β₁ (terminal or nonterminal), each of which has a derivative without any occurrence of ⋆.

Proof of Claim 2: Let $i \in \overline {\mathfrak {I}_{\diamond }}$, i. e., there is no nonterminal with derivative 〈i〉_◇. Furthermore, a derivative that properly contains 〈i〉_◇ (and the corresponding nonterminal which occurs in β₁) contains an occurrence of ⋆ and occurrences of symbols from both sets {d₁,…, d₇} and {x₁,…, x₇}, which contradicts Lemma 4. Consequently, each of the 7 occurrences of 〈i〉_◇ are produced by at least two symbols. Hence, for each of these 7 occurrences, there is one symbol producing a factor of 〈i〉_◇ containing the symbol ⋆ and a second symbol, which produces a factor of 〈i〉_◇ that contains symbols from {d₁,…, d₇}. Due to Lemma 4, this second symbol cannot also produce the next or preceding occurrence of ⋆. This means that for each $i \in \overline {\mathfrak {I}_{\diamond }}$, there exist 7 symbols that do not produce a symbol ⋆. In the same way, we can also conclude that for each $i \in \overline {\mathfrak {I}_{v}}$, there exist 7 symbols that do not produce a symbol ⋆. However, it is possible that these symbols in β₁ which do not produce a ⋆ coincide, i. e., such a symbol can produce parts of some 〈i〉_◇ with $i \in \overline {\mathfrak {I}_{\diamond }}$ and $\langle i^{\prime } \rangle _{v}$, with $i^{\prime } \in \overline {\mathfrak {I}_{v}}$. So we can only conclude that there are at least $7\lceil \frac {|\overline {\mathfrak {I}_{\diamond }}| + |\overline {\mathfrak {I}_{v}}|}{2}\rceil $ occurrences of symbols in β₁ that do not produce an occurrence of ⋆. (Claim 2) $\square $From these two claims, it follows that the axiom of G (and therefore the whole grammar G) has size of at least $196n + 7\lceil \frac {|\overline {\mathfrak {I}_{\diamond }}| + |\overline {\mathfrak {I}_{v}}|}{2}\rceil $. We now change $G^{\prime }$ a second time (into $G^{\prime \prime }$), as follows. We replace β₁ in the axiom $\mathsf {ax}^{\prime } = {\prod }^{6}_{i=1}(\beta _{i} \$_{i}) \beta _{7}$ of $G^{\prime }$ (note that Observation 2 implies that $\mathsf {ax}^{\prime }$ must have this structure) by $\beta ^{\prime }_{1} = {\prod }^{6}_{j=0} {\prod }^{14n}_{i=1} D_{i} V_{M(i+j,14n)}$. We note that $|\beta _{1}| \geq 196n + 7\lceil \frac {|\overline {\mathfrak {I}_{\diamond }}| + |\overline {\mathfrak {I}_{v}}|}{2}\rceil $, whereas $|\beta ^{\prime }_{1}| = 196n$. Consequently,

$$ \begin{array}{@{}rcl@{}} |G^{\prime\prime}| &=& \underbrace{|G| + 3(|\overline{\mathfrak{I}_{\diamond}}| + |\overline{\mathfrak{I}_{v}}|)}_{|G^{\prime}|} + |\beta^{\prime}_{1}| - |\beta_{1}|\\ &\leq& |G| + 3(|\overline{\mathfrak{I}_{\diamond}}| + |\overline{\mathfrak{I}_{v}}|) + 196n - \left( 196n + 7\left\lceil \frac{|\overline{\mathfrak{I}_{\diamond}}| + |\overline{\mathfrak{I}_{v}}|}{2}\right\rceil\right)\\ &=& |G| + 3(|\overline{\mathfrak{I}_{\diamond}}| + |\overline{\mathfrak{I}_{v}}|) - 7\left\lceil \frac{|\overline{\mathfrak{I}_{\diamond}}| + |\overline{\mathfrak{I}_{v}}|}{2}\right\rceil\\ &\leq& |G| . \end{array} $$

□

In the hardness proof from [33, 34] for the case of unbounded alphabets (see Page 16), one simple, but crucial fact was that for every i, 1 ≤ i ≤ n, we can assume that nonterminals for each factor #v_i and v_i# exist. By using the previously mentioned lemmas, we now show a similar statement for our reduction:

Lemma 6

There is a smallest grammar G for $w_{\mathcal {G}}$ such that, for every i, 1 ≤ i ≤ n, there is a nonterminal with derivative #〈7i + C_v(i)〉_v and a nonterminal with derivative 〈7i + C_v(i)〉_v#.

Proof

Let G = (N,Σ, R, ax) be a smallest grammar for $w_{\mathcal {G}}$. By Lemma 5, we can assume that, for every i, 1 ≤ i ≤ 14n, there is a nonterminal D_i with derivative 〈i〉_◇ and a nonterminal V_i with derivative 〈i〉_v.

Let ℓ be the total number of occurrences of symbols from {⋆, ¢₁, ¢₂, #, $₁, $\dots $, $₆} in $w_{\mathcal {G}}$. We can conclude that |ax|≤ ℓ, since an axiom of length ℓ can be obtained from $w_{\mathcal {G}}$ (without introducing any new rules) by replacing all occurrences of 〈i〉_◇ and 〈i〉_v by D_i and V_i, respectively.

Let $N_{\mathsf {ax}} = \{A \mid A \in N, |\mathsf {ax}|_{A} \geq 1, \mathfrak {D}(A)_{\star } \geq 1\}$ and let Γ = {⋆,¢₁,¢₂, #}. Furthermore, for every i, 1 ≤ i ≤ 3, $N_{\mathsf {ax}, i} = \{A \mid A \in N_{\mathsf {ax}}, {\sum }_{x \in {\Gamma }} |\mathfrak {D}(A)|_{x} = i\}$. Since, for every A ∈ N_ax, ${\sum }_{x \in {\Gamma }} |\mathfrak {D}(A)|_{x} > 3$ is a contradiction to Lemma 4, we can conclude that {N_ax,1, N_ax,2, N_ax,3} is a partition of N_ax. Consequently, we can use this partition in order to estimate the length of the axiom in the following way: $|\mathsf {ax}| \geq \ell - {\sum }_{A \in N_{\mathsf {ax}, 2}} |\mathsf {ax}|_{A} - 2 {\sum }_{A \in N_{\mathsf {ax}, 3}} |\mathsf {ax}|_{A}$ (note that each occurrence of some A ∈ N_{ax, j}, j ∈{2,3}, is responsible for |ax|_A units of the size |ax|, but also for exactly j|ax|_A occurrences of the total amount ℓ of symbols from $\{\star , {\cent}_{1}, {\cent}_{2}, \#, \$_{1},\dots ,\$_{6}\}$). Moreover, also due to Lemma 4, for every A ∈ N_ax,2, $\mathfrak {D}(A) = \# f(7i+C_{v}(i)) \star r_{i}$ or $\mathfrak {D}(A) = r_{i} \star f(7i+C_{v}(i))^{R}\#$ with |r_i|≤|f(7i + C_v(i))| and, for every A ∈ N_ax,3, $\mathfrak {D}(A) = \#\langle 7i+C_{v}(i) \rangle _{v} \#$.

We now add to G, for every i, 1 ≤ i ≤ n, the rules $\overset {{~}_{\leftarrow }}{V_{i}}\rightarrow \# V_{7i+C_{v}(i)}$ and ${\!}_{\rightarrow }{V_{i}}\rightarrow V_{7i+C_{v}(i)} \#$, and, for every A ∈ N_ax,3, we add the rule $\overset {{~}_{\leftrightarrow }}{V_{i}}\rightarrow \overset {{~}_{\leftarrow }}{V_{i}} \#$, where $\mathfrak {D}(A) = \#\langle 7i+C_{v}(i) \rangle _{v} \#$. Then, we replace ax by a new axiom $\mathsf {ax}^{\prime }$ that is obtained from $w_{\mathcal {G}}$ in the following way. Every factor 〈i〉_◇ is replaced by D_i. For every occurrence of ⋆ in $w_{\mathcal {G}}$, if this occurrence of ⋆ is produced (according to ax) by a nonterminal A ∈ N_ax,3, which, since $\mathfrak {D}(A) = \#\langle 7i+C_{v}(i) \rangle _{v} \#$, implies that it is inside a factor #〈7i + C_v(i)〉_v#, then we replace #〈7i + C_v(i)〉_v# by $\overset {{~}_{\leftrightarrow }}{V_{i}}$. All remaining factors of the form #〈7i + C_v(i)〉_v# are replaced by $\overset {{~}_{\leftarrow }}{V_{i}} \#$. Then, all remaining factors #〈7i + C_v(i)〉_v and 〈7i + C_v(i)〉_v# are replaced by $\overset {{~}_{\leftarrow }}{V_{i}}$ and ${\!}_{\rightarrow }{V_{i}}$, respectively (note that since there are no factors of the form #〈7i + C_v(i)〉_v# left, this is unambiguous). We note that $|\mathsf {ax}^{\prime }| = \ell - {\sum }_{i=1}^{n}(|\mathsf {ax}^{\prime }|{\overset {{~}_{\leftarrow }}{V_{i}}}+|\mathsf {ax}^{\prime }|{{\!}_{\rightarrow }{V_{i}}}) - 2 {\sum }_{i=1}^{n} |\mathsf {ax}^{\prime }|{\overset {{~}_{\leftrightarrow }}{V_{i}}}$.

Next, we show that all the rules for the nonterminals of N_ax,2 ∪ N_ax,3 can be removed from the grammar. To this end, let A ∈ N_ax,2 ∪ N_ax,3, which means that $|\mathfrak {D}(A)|_{\#} \geq 1$. However, every occurrence of # of $w_{\mathcal {G}}$ that is produced by a rule (and is not already present in the new axiom $\mathsf {ax}^{\prime }$), is directly produced by $\overset {{~}_{\leftarrow }}{V_{i}}$, ${\!}_{\rightarrow }{V_{i}}$ or $\overset {{~}_{\leftrightarrow }}{V_{i}}$, i. e., it occurs on the right side of these rules and is not produced by means of any other nonterminal. Consequently, in the derivation of $w_{\mathcal {G}}$, the nonterminal A is not used and, therefore, its rule can be erased.

It only remains to show that the modified grammar is not larger than the original one, i. e., we have to compare $|\mathsf {ax}^{\prime }|$ to |ax| show that the size increase of 2 caused by each added rule is compensated. For every new rule $\overset {{~}_{\leftrightarrow }}{V_{i}}\rightarrow \overset {{~}_{\leftarrow }}{V_{i}} \#$ (of cost 2), there is an A ∈ N_ax,3 with $\mathfrak {D}(A) = \# \langle 7i+C_{v}(i) \rangle _{v} \#$ (of cost at least 2), for which the rule is erased and all all occurrences of A in ax correspond to occurrences of some $\overset {{~}_{\leftrightarrow }}{V_{i}}$ in $\mathsf {ax}^{\prime }$, hence ${\sum }_{i=1}^{n} |\mathsf {ax}^{\prime }|{\overset {{~}_{\leftrightarrow }}{V_{i}}}={\sum }_{A \in N_{\mathsf {ax}, 3}} |\mathsf {ax}|_{A}$. For every new rule $\overset {{~}_{\leftarrow }}{V_{i}}\rightarrow \# V_{7i+C_{v}(i)}$ consider $\overset {{~}_{\leftarrow }}{I}:=\{i\colon \mathfrak {D}(A) = \# f(7i+C_{v}(i)) \star r_{i} \text { for some } A \in N_{\mathsf {ax}, 2}\}$. If $i\in \overset {{~}_{\leftarrow }}{I}$ we have removed at least one rule A → α with $ \mathfrak {D}(A) = \# f(7i+C_{v}(i)) \star r_{i}$ with |α|≥ 2, so the cost for all rules $\overset {{~}_{\leftarrow }}{V_{i}}\rightarrow \# V_{7i+C_{v}(i)}$ with $i\in \overset {{~}_{\leftarrow }}{I}$ is compensated. Further, every occurrence of this A in ax yields an occurrence of $\overset {{~}_{\leftarrow }}{V_{i}}$ in $\mathsf {ax}^{\prime }$. If $i\not \in \overset {{~}_{\leftarrow }}{I}$, then both occurrences of #〈7i + C_v(i)〉_v in the factor v of $w_{\mathcal {G}}$ are produced in ax by at least two nonterminals each. An analogous argument applies to the new rules ${\!}_{\rightarrow }{V_{i}}\rightarrow V_{7i+C_{v}(i)} \#$ with ${\!}_{\rightarrow }{I}:=\{i\colon \mathfrak {D}(A) = r_{i}\star f(7i+C_{v}(i))^{R} \# \text { for some } A \in N_{\mathsf {ax}, 2}\}$. This yields ${\sum }_{i=1}^{n}(|\mathsf {ax}^{\prime }|{\overset {{~}_{\leftarrow }}{V_{i}}}+|\mathsf {ax}^{\prime }|{{\!}_{\rightarrow }{V_{i}}}) \geq {\sum }_{A \in N_{\mathsf {ax}, 2}} |\mathsf {ax}|_{A} +2(n-|\overset {{~}_{\leftarrow }}{I}|)+2(n-|{\!}_{\rightarrow }{I}|)$. Together with ${\sum }_{i=1}^{n} |\mathsf {ax}^{\prime }|{\overset {{~}_{\leftrightarrow }}{V_{i}}}={\sum }_{A \in N_{\mathsf {ax}, 3}} |\mathsf {ax}|_{A}$ we can conclude:

$$ \begin{array}{@{}rcl@{}} |\mathsf{ax}^{\prime}| &=& \ell - \sum\limits_{i=1}^{n}(|\mathsf{ax}^{\prime}|{\overset{{~}_{\leftarrow}}{V_{i}}}+|\mathsf{ax}^{\prime}|{{\!}_{\rightarrow}{V_{i}}}) - 2 \sum\limits_{i=1}^{n} |\mathsf{ax}^{\prime}|{\overset{{~}_{\leftrightarrow}}{V_{i}}}\\ &\leq& \ell - \sum\limits_{A \in N_{\mathsf{ax}, 2}} |\mathsf{ax}|_{A} -2(n-|\overset{{~}_{\leftarrow}}{I}|)-2(n-|{\!}_{\rightarrow}{I}|) - 2\!\!\sum\limits_{A \in N_{\mathsf{ax}, 3}} |\mathsf{ax}|_{A}\\ & \leq&|\mathsf{ax}| -2(n-|\overset{{~}_{\leftarrow}}{I}|)-2(n-|{\!}_{\rightarrow}{I}|) \end{array} $$

Since every new rule for ${\overset {{~}_{\leftarrow }}{V_{i}}}$ or ${{\!}_{\rightarrow }{V_{i}}}$ is added at a cost of two, the difference between $|\mathsf {ax}^{\prime }|$ and |ax| compensates for the additional rules $\overset {{~}_{\leftarrow }}{V_{i}}\rightarrow \#V_{7i+C_{v}(i)} $ with $i\not \in \overset {{~}_{\leftarrow }}{I}$ and ${\!}_{\rightarrow }{V_{i}}\rightarrow V_{7i+C_{v}(i)} \#$ with $i\not \in {\!}_{\rightarrow }{I}$. Recall further that the cost for the rules for ${\overset {{~}_{\leftrightarrow }}{V_{i}}}$ are compensated by deleting the rules in N_ax,3. Overall, the modified grammar is not larger than the original grammar. Furthermore, the new grammar has now the form stated in the lemma. □

Now, by the lemmas presented above, we are able to sufficiently pin down the structure of a smallest grammar for $w_{\mathcal {G}}$:

Lemma 7

There is a smallest grammar G for $w_{\mathcal {G}}$ that contains all the rules

R_◇ := {r_{◇, i} : ∣1 ≤ i ≤ 14n}, with $r_{\diamond , i} := D_{i} \rightarrow d_{i} \star d_{i}$, 1 ≤ i ≤ 7, and $r_{\diamond , i} := D_{i} \rightarrow g(i)[1] D_{h(i)} g(i)[1]$, 8 ≤ i ≤ 14n, where $h(i):= \frac {i-M(i,7)}{7}$,
R_v := {r_{v, i} : ∣1 ≤ i ≤ 14n}, with $r_{v, i} := V_{i} \rightarrow x_{i} \star x_{i}$, 1 ≤ i ≤ 7, and $r_{v, i} := V_{i} \rightarrow f(i)[1] V_{h(i)} f(i)[1]$, 8 ≤ i ≤ 14n, where $h(i):= \frac {i-M(i,7)}{7}$,
$\overset {{~}_{\leftarrow }}{V} := \{\overset {{~}_{\leftarrow }}{V_{i}}\rightarrow \# V_{7i+C_{v}(i)} \mid 1 \leq i \leq n\}$,
${\!}_{\rightarrow }{V} := \{{\!}_{\rightarrow }{V_{i}} \rightarrow V_{7i+C_{v}(i)} \# \mid 1 \leq i \leq n\}$,
$\overset {{~}_{\leftrightarrow }}{V} := \{\overset {{~}_{\leftrightarrow }}{V_{i}} \rightarrow \# {\!}_{\rightarrow }{V_{i}} \mid i \in \mathfrak {I}\}$, for some $\mathfrak {I} \subseteq \{1, 2, \ldots , n\}$.

and an axiom $\mathsf {ax} = {\prod }^{6}_{i=1}(\beta _{i} \$_{i}) \beta _{7}$ with

$$ \begin{array}{@{}rcl@{}} &&\beta_{1} = \prod\limits_{j=0}^{6}\left( \prod\limits_{i=1}^{14n} (D_{i} V_{M(i+j,14n)}) \right) ,\qquad \beta_{2} = \prod\limits_{i=1}^{n} \left( \overset{{~}_{\leftarrow}}{V_{i}} {\cent}_{1} D_{7i-1} \right) ,\\ &&\beta_{3} = \prod\limits_{i=1}^{n} \left( \overset{{~}_{\leftarrow}}{V_{i}} {\cent}_{2} D_{7i-2}\right) , \qquad \qquad\quad \beta_{4} = \prod\limits_{i=1}^{n} \left( {\!}_{\rightarrow}{V_{i}} D_{7i-2} {\cent}_{1}\right) , \\ &&\beta_{5} = \prod\limits_{i=1}^{n} \left( {\!}_{\rightarrow}{V_{i}} D_{7i-1} {\cent}_{2}\right) , \end{array} $$

$$ \begin{array}{@{}rcl@{}} \beta_{6} &= &\prod\limits_{i=1}^{n} \left( y_{i} D_{7i} \right),\text{ where for every \textit{i}, $1 \leq i \leq n$, } y_{i}=\begin{cases} \overset{{~}_{\leftrightarrow}}{V_{i}}& \text{ if } i \in \mathfrak{I},\\ \overset{{~}_{\leftarrow}}{V_{i}} \#& \text{ otherwise}, \end{cases}\\ \beta_{7} &= &\prod\limits_{i=1}^{m-1}(y_{i} D_{7i+C_{e}(v_{j_{2i}},v_{j_{2i+1}})}) y_{m}, \text{ where for every \textit{i}, $1 \leq i \leq m$, }\\ &&y_{i} \in \{\overset{{~}_{\leftrightarrow}}{V}_{j_{2i-1}} {\!}_{\rightarrow}{V}_{j_{2i}}, \overset{{~}_{\leftarrow}}{V}_{j_{2i-1}} \overset{{~}_{\leftrightarrow}}{V}_{j_{2i}}\} \text{ if } \{j_{2i-1}, j_{2i}\} \cap \mathfrak{I} \neq \emptyset,\\ &&y_{i} = \overset{{~}_{\leftarrow}}{V}_{j_{2i-1}} \overset{{~}_{\leftarrow}}{V}_{j_{2i}}\# \text{ otherwise}. \end{array} $$

Proof

Let G be a smallest grammar for $w_{\mathcal {G}}$. By Lemma 5, we can assume that, for every i, 1 ≤ i ≤ 14n, there is a nonterminal D_i with derivative 〈i〉_◇ and a nonterminal V_i with derivative 〈i〉_v, and, by Lemma 6, we can assume that, for every i, 1 ≤ i ≤ n, there is a nonterminal $\overset {{~}_{\leftarrow }}{V_{i}}$ with derivative #〈7i + C_v(i)〉_v and a nonterminal ${\!}_{\rightarrow }{V_{i}}$ with derivative 〈7i + C_v(i)〉_v#. Obviously, for every i, 1 ≤ i ≤ n, we can substitute the rule for $\overset {{~}_{\leftarrow }}{V_{i}}$ by $\overset {{~}_{\leftarrow }}{V_{i}} \to \# V_{i}$ and the rule for ${\!}_{\rightarrow }{V_{i}}$ by ${\!}_{\rightarrow }{V_{i}} \to V_{i} \#$, without increasing the size of G.

Next, for every V_j → α_j with |α_j|≥ 3, we can replace V_j → α_j by V_j → x_j ⋆ x_j, if j ≤ 7, and by $V_{j} \rightarrow f(j)[1] V_{h(j)} f(j)[1]$, if 8 ≤ j, where $h(j):= \frac {j-M(j,7)}{7}$. This does not increase the size of G, since the size of the modified rules can only decrease and no new rules need to be added. Now let $j = \max \limits \{i \mid 1 \leq i \leq 14n, V_{i} \to \alpha _{i}, |\alpha _{i}| = 2\}$. We can now again replace V_j → α_j by V_j → x_j ⋆ x_j, if j ≤ 7, and by $V_{j} \rightarrow f(j)[1] V_{h(j)} f(j)[1]$, if 8 ≤ j, where $h(j):= \frac {j-M(j,7)}{7}$, but now this operation increases the size of the grammar by 1, which, as shall be shown next, is compensated by removing a rule from the grammar. To this end, we note that α_j = A_jB_j and $\mathfrak {D}(A_{j}) = f(j) \star t_{j}$ or $\mathfrak {D}(B_{j}) = t_{j} \star f(j)^{R}$ for some $t_{j} \in \{x_{1}, \ldots , x_{7}\}^{*}$. Let us assume that $\mathfrak {D}(A_{j}) = f(j) \star t_{j}$ (the case $\mathfrak {D}(B_{j}) = t_{j} \star f(j)^{R}$ can be handled analogously); note that this particularly implies that A_j∉{V_i∣1 ≤ i ≤ 14n}, since its derivative is not of the form 〈i〉_v. Since f(j) ⋆ t_j does not occur in any $\langle j^{\prime } \rangle _{v}$ with $j^{\prime } < j$, A_j is not involved in a production of any $\langle j^{\prime } \rangle _{v}$ with $j^{\prime } < j$. Moreover, A_j cannot occur on the right side of the rule for a $V_{j^{\prime }}$ with $j < j^{\prime }$, since, due to the maximality of j and the modifications from above, those only have nonterminals of the form V_i on the right side. Thus, A_j has no occurrence in any of the rules for the nonterminals V_i, 1 ≤ i ≤ 14n. This means that A_j can only occur on the right side of some nonterminal with a derivative that is not a factor of some 〈i〉_v and, since $|\mathfrak {D}(A_{j})|_{\star } \geq 1$, with Lemma 4, we can further conclude that A_j can only occur on the right side of some nonterminal with a derivative #〈i〉_v, 〈i〉_v# or #〈i〉_v#. The rules $\overset {{~}_{\leftarrow }}{V_{i}} \to \# V_{i}$ and ${\!}_{\rightarrow }{V_{i}} \to V_{i} \#$ have the derivatives #〈i〉_v and 〈i〉_v#, respectively, and their right sides do not contain A_j. Furthermore, if the right side of a nonterminal with derivative #〈i〉_v# contains A_j, we can replace it by $\overset {{~}_{\leftarrow }}{V_{i}} \#$ without increasing the size of the grammar. Consequently, we can assume that the nonterminal A_j is never used and therefore its rule can be removed. By repeating this argument, it follows that G contains all the rules R_v.

In a similar way, we can show that G contains all the rules R_◇ (in fact, the argument is simpler, since in this case, Lemma 4 together with the fact that A_j can only occur on the right side of some nonterminal with a derivative that is not a factor of some 〈i〉_◇ immediately implies that A_j does not occur on any right side).

We now assume that $\mathsf {ax} = {\prod }^{6}_{i=1}(\beta _{i} \$_{i}) \beta _{7}$ is the axiom of G. In the same way as in the proofs of Lemmas 5 and 6, we can conclude that |β₁|≥ 196n, |β_ℓ|≥ 3n, 1 ≤ ℓ ≤ 5. Hence, replacing ax by $\mathsf {ax}^{\prime } = {\prod }^{6}_{i=1}(\beta ^{\prime }_{i} \$_{i}) \beta ^{\prime }_{7}$ with

$$ \begin{array}{@{}rcl@{}} &&\beta^{\prime}_{1} = \prod\limits_{j=0}^{6}\left( \prod\limits_{i=1}^{14n} (D_{i} V_{M(i+j,14n)}) \right) , \beta^{\prime}_{2} = \prod\limits_{i=1}^{n} \left( \overset{{~}_{\leftarrow}}{V_{i}} {\cent}_{1} D_{7i-1} \right) ,\\ &&\beta^{\prime}_{3} = \prod\limits_{i=1}^{n} \left( \overset{{~}_{\leftarrow}}{V_{i}} {\cent}_{2} D_{7i-2}\right) , {\kern35pt}\beta^{\prime}_{4} = \prod\limits_{i=1}^{n} \left( {\!}_{\rightarrow}{V_{i}} D_{7i-2} {\cent}_{1}\right) , \\ &&\beta^{\prime}_{5} = \prod\limits_{i=1}^{n} \left( {\!}_{\rightarrow}{V_{i}} D_{7i-1} {\cent}_{2}\right) , \end{array} $$

does not increase the size of the grammar. We now consider β₆, which produces the word $v_{6} = {\prod }_{i=1}^{n} \left (\# \langle 7i+C_{v}(i) \rangle _{v} \# \langle 7i \rangle _{\diamond }\right )$. We can conclude the following from Lemma 4. No two occurrences of ⋆ in v₆ can be produced by the same nonterminal; thus, |β₆|≥ 2n. Furthermore, the only factors that are repeated in $w_{\mathcal {G}}$ and that contain an occurrence of both ⋆ and # are factors of #〈7i + C_v(i)〉_v#. Hence, for every i, 1 ≤ i ≤ n, if the factor #〈7i + C_v(i)〉_v# in #〈7i + C_v(i)〉_v#〈7i〉_◇ is not produced by a single nonterminal, then there is an additional nonterminal in β₆ (i. e., in addition to the two nonterminals producing the two occurrences of ⋆ in #〈7i + C_v(i)〉_v#〈7i〉_◇). This implies that |β₆|≥ 3n − p, where p is the number of nonterminals with a derivative of #〈7i + C_v(i)〉_v#. This means that we can replace every such nonterminal and its rule by $\overset {{~}_{\leftrightarrow }}{V_{i}} \rightarrow \# {\!}_{\rightarrow }{V_{i}}$ without increasing the size of the grammar. Furthermore, again without increasing the size of the grammar, we can replace β₆ by ${\prod }_{i=1}^{n} \left (y_{i} D_{7i} \right )$, where, for every i, 1 ≤ i ≤ n, $y_{i} = \overset {{~}_{\leftrightarrow }}{V_{i}}$ if this nonterminal exists and $y_{i} = \overset {{~}_{\leftarrow }}{V_{i}} \#$ otherwise.

Next, we consider β₇, which produces the word

$$ \begin{array}{@{}rcl@{}} v_{7} = \prod\limits_{i=1}^{m-1}&(\!\!&\# \langle 7j_{2i-1}+C_{v}(j_{2i-1}) \rangle_{v} \# \langle 7j_{2i}+C_{v}(j_{2i}) \rangle_{v} \# \langle 7i+C_{e}(v_{j_{2i}},v_{j_{2i+1}}) \rangle_{\diamond} )\\ &&\# \langle 7j_{2m-1}+C_{v}(j_{2m-1}) \rangle_{v} \# \langle 7j_{2m}+C_{v}(j_{2m}) \rangle_{v} \# . \end{array} $$

Similar as for the word v₆, every occurrence of ⋆ in v₇ requires a distinct nonterminal and, in addition to that, also a distinct nonterminal for each factor #〈7i + C_v(i)〉_v# that is not completely produced by a single nonterminal. Hence, |β₇|≥ 4m − 1 − q, where q is the number of nonterminals $\overset {{~}_{\leftrightarrow }}{V_{i}}$ used in β₇. Consequently, we can also replace β₇ by $v_{7} = {\prod }_{i=1}^{m-1}(y_{i} D_{7i+C_{e}(v_{j_{2i}},v_{j_{2i+1}})}) y_{m}$, where, for every i, 1 ≤ i ≤ m, $y_{i} \in \{\overset {{~}_{\leftrightarrow }}{V}_{j_{2i-1}} {\!}_{\rightarrow }{V}_{j_{2i}}, \overset {{~}_{\leftarrow }}{V}_{j_{2i-1}} \overset {{~}_{\leftrightarrow }}{V}_{j_{2i}}\}$, if $\overset {{~}_{\leftrightarrow }}{V}_{j_{2i-1}}$ or $\overset {{~}_{\leftrightarrow }}{V}_{j_{2i}}$ exist, and $y_{i} = \overset {{~}_{\leftarrow }}{V}_{j_{2i-1}} \overset {{~}_{\leftarrow }}{V}_{j_{2i}}\#$, otherwise. We note that this does not increase the size of the grammar.

The grammar has now the form claimed in the statement of the lemma (note that all other rules not mentioned in the statement of the lemma can be ignored, since they are not used anymore). □

Finally, we are able to conclude the proof of correctness by establishing the connection between the size of a smallest grammar for $w_{\mathcal {G}}$ and the size of a vertex cover for $\mathcal {G}$.

Lemma 8

The graph $\mathcal {G}$ has a vertex cover of size k if and only if $w_{\mathcal {G}}$ has a grammar of size 299n + k + 3m + 5.

Proof

Let Γ be a size-k vertex cover of $\mathcal {G}$. We construct the grammar described in Lemma 7 with respect to $\mathfrak {I} = \{i \mid v_{i} \in {\Gamma }\}$. Since Γ is a vertex cover, in the definition of β₇, we have $y_{i} \in \{\overset {{~}_{\leftrightarrow }}{V}_{j_{2i-1}} {\!}_{\rightarrow }{V}_{j_{2i}}, \overset {{~}_{\leftarrow }}{V}_{j_{2i-1}} \overset {{~}_{\leftrightarrow }}{V}_{j_{2i}}\}$, for every 1 ≤ i ≤ m. Consequently, by simply counting the symbols on the right sides of the rules, we conclude $|G| = 299n + |\mathfrak {I}| + 3m + 5 = 299n + k + 3m + 5$.

On the other hand, if there is a grammar of size 299n + k + 3m + 5 for $w_{\mathcal {G}}$, then, by Lemma 7, we can also assume that there exists a grammar G for $w_{\mathcal {G}}$ with $|G| = 299n + |\mathfrak {I}| + 3m + 5 \leq 299n + k + 3m + 5$ that has the form described in Lemma 7, with respect to some $\mathfrak {I} \subseteq \{1, 2, \ldots , n\}$. If, for some edge (v_i, v_j), $\{v_{i}, v_{j}\} \cap \mathfrak {I} = \emptyset $, then adding i to $\mathfrak {I}$ (and therefore the rule $\overset {{~}_{\leftrightarrow }}{V_{i}} \rightarrow \# {\!}_{\rightarrow }{V_{i}}$ to the grammar) does not increase the size of the grammar. This is due to the fact that the additional cost of 2 for introducing the rule is compensated by using $\overset {{~}_{\leftrightarrow }}{V_{i}}$ once in β₆ and once in β₇. Consequently, we can assume that ${\Gamma } = \{v_{i} \mid i \in \mathfrak {I}\}$ is a vertex cover. Since $|G| = 299n + |\mathfrak {I}| + 3m + 5 \leq 299n + k + 3m + 5$, this means that Γ is a vertex cover for $\mathcal {G}$ of size at most $|\mathfrak {I}|=k$. □

From Lemma 8, we directly conclude our main result:

Theorem 3

SGP is NP-complete, even for alphabets of size 24.

Obviously, Theorem 3 leaves some room for improvement with respect to smaller alphabet sizes. In our reduction, we did use terminal symbols economically, but, for reasons explained next, this was not our main concern. While we generally believe that the alphabet size can be slightly reduced in our reduction, we consider it very unlikely that its current structure allows a substantial improvement in this regard (e. g., an alphabet size below 10). Thus, we did not further pursue this point, which we expect to lead to an even more involved reduction while at the same time only insignificantly decreases the alphabet size. Consequently, the NP-hardness of the smallest grammar problem for small alphabets (with the most interesting candidates being 2 (i. e., binary strings) and 4 (due to the fact that DNA-sequences use a 4-letter alphabet)) remains open. Furthermore, we expect that completely new techniques are required for respective hardness reductions. In this regard, note that for alphabets of size 1, the smallest grammar problem is strongly connected to the problem of computing the smallest addition chain for a single integer; a problem that is neither known to be in P nor to be NP-hard (see [34] or Section 6 for details).

3.3 Extensions of the Reductions

In this section, we conclude several important hardness results by slight modifications of the reduction presented in Section 3.2. First, we show that the optimisation variant of the smallest grammar problem (over fixed alphabets) is APX-hard and therefore it does not allow for a polynomial-time approximation scheme, unless P = NP. Just like Theorem 3 lifts the known NP-hardness of the smallest grammar problem for unbounded alphabets to the practically relevant case of fixed alphabets, this APX-hardness result lifts the inapproximability result for unbounded alphabets of [33, 34] to the fixed alphabet case. There is one caveat, though, which is that the corresponding constant lower bound on the approximation ratio is much lower than the already low 1.0001 achieved for unbounded alphabets; thus, we do not bother to actually compute it and we consider the value of the APX-hardness result that the existence of a PTAS is ruled out.

Theorem 4

SGP_opt is APX-hard, even for alphabets of size 24.

Proof

The reduction used for Theorem 3 can also be seen as an L-reduction from the optimisation variant of the minimum vertex cover problem restricted to cubic graphs (each vertex has degree 3), which remains APX-hard (see [52]). More precisely, this problem is denoted by (I_VC, S_VC, m_VC), where I_VC is the set of undirected cubic graphs, $S_{\textsc {VC}}(\mathcal {G}) = \{C \mid C \text { is a vertex cover for } \mathcal {G}\}$ and $m_{\textsc {VC}}(\mathcal {G}, C) = |C|$; we denote SGP_opt by (I_SGP, S_SGP, m_SGP).

Next, we describe an L-reduction from the problem (I_VC, S_VC, m_VC) to the problem (I_SGP, S_SGP, m_SGP). The above described translation of a graph $\mathcal {G}$ to the word $w_{\mathcal {G}}$ (i. e., the one defined in Section 3.2 in order to prove Theorem 3) gives the function f for the L-reduction. The function g, that maps $\mathcal {G} \in I_{\textsc {VC}}$ and a grammar $G \in S_{\textsc {SGP}}(f(\mathcal {G}))$ to a vertex cover $C \in S_{\textsc {VC}}(\mathcal {G})$ works as follows. We first build a grammar $G^{\prime }$ with $|G^{\prime }|\leq |G|$ which is of the form described in Lemma 7; observe that all transformations that are necessary to reach this kind of normal form are constructive and computable in polynomial time. Then $g(\mathcal {G}, G) = \{v_{i} \mid i \in \mathfrak {I}\}$, which is a vertex cover for $\mathcal {G}$ by Lemma 8 (note that the set $\mathfrak {I}$ is ensured by Lemma 7). Finally, we show that choosing β = 613 and γ = 1 satisfies the inequalities. To this end, we first note that, for any cubic graph $\mathcal {G}$ with n vertices and m edges, we have $m=\frac {3}{2}n$ (since each vertex has degree 3) and $m_{\textsc {VC}}^{*}(\mathcal {G}) \geq \frac {n}{2}$ (since each vertex can cover at most three edges), and $m_{\textsc {VC}}^{*}(\mathcal {G}) \geq 1$.

$$ \begin{array}{@{}rcl@{}} m_{\textsc{SGP}}^{*}(w_{\mathcal{G}}) &=& 299n + 3m + 5 + m_{\textsc{VC}}^{*}(\mathcal{G})\\ &=& 607 \cdot \frac{n}{2} + 5 + m_{\textsc{VC}}^{*}(\mathcal{G}) \\ &\leq& 607 \cdot m_{\textsc{VC}}^{*}(\mathcal{G}) + 5 + m_{\textsc{VC}}^{*}(\mathcal{G}) \\ &\leq& 613 \cdot m_{\textsc{VC}}^{*}(\mathcal{G}) = \beta \cdot m_{\textsc{VC}}^{*}(\mathcal{G}) , \end{array} $$

for any $\mathcal {G}\in I_{\textsc {VC}}$. Furthermore,

$$ \begin{array}{@{}rcl@{}} m_{\textsc{VC}}(\mathcal{G},g(\mathcal{G},G)) - m_{\textsc{VC}}^{*}(\mathcal{G}) &= &(299n + 3m + 5 + m_{\textsc{VC}}(\mathcal{G},g(\mathcal{G},G))) - \\ &&(299n + 3m + 5 + m_{\textsc{VC}}^{*}(\mathcal{G}))\\ &= &1\cdot (m_{\textsc{SGP}}(w_{\mathcal{G}},G) - m_{\textsc{SGP}}^{*}(w_{\mathcal{G}})) , \end{array} $$

for any $\mathcal {G}\in I_{\textsc {VC}}$ and $G\in S_{\textsc {SGP}}(w_{\mathcal {G}})$. □

Next, we take a closer look at the rule-size measure of grammars, i. e., at the problems SGP_r and 1-SGP_r. As defined in Section 2.2, the rule-size also takes the number of rules into account. In fact, the literature on grammar-based compression is inconsistent with respect to which kind of size is used, e. g., in [6, 8, 15, 33, 34, 34, 41], the size of a grammar coincides with our definition |⋅|, while in [4, 53,54,55], the rule-size is used. The rule-size seems to be mainly motivated by the question of how a grammar is encoded as a single string, which, in any reasonable way, requires an additional symbol per rule.^{Footnote 9} In many contexts, the difference between size and rule-size of grammars seems negligible, but, formally, the problems SGP and SGP_r (as well as 1-SGP and 1-SGP_r) are different decision problems and hardness results do not automatically carry over from one to the other. Since the existing literature suggests that the rule-size is of interest as well, we consider it a worthwhile task to extend our hardness results accordingly.

It seems intuitively clear that the size increase caused by measuring with the rule-size does not have an impact on the complexity of the smallest grammar problem. In fact, the arguments in the proof for Theorem 2 for the 1-level case also apply for the rule-size, but with an addition of 2n + k + 2 (i. e., the number of rules) to the size of an r-smallest grammar. This is due to the fact that the rules that are introduced in the proof of Lemma 3 also shorten the grammar with respect to the rule-size measure.

Theorem 5

1-SGP_r is NP-complete, even for even for alphabets of size 5.

In the multi-level case, however, the situation is not so simple. In particular, in the proof of Theorem 3, there are some arguments, which do not apply for the rule-size. For example, a rule which only compresses a factor of length two is only profitable (with respect to the rule-size) if it can be used at least three times, which is problematic, since the rules which correspond to the vertex cover have length two and, in case the vertex only covers one edge, compress factors which only occur twice. Beside these problems, already in Lemma 5, we can see that it is hard to prove that the rule-size of the desired grammar $G^{\prime \prime }$ is smaller than |G|_r as we now have to pay a cost of 4 for each rule V_i (or D_i) with $i\notin \mathfrak {I}_{v}$ (or $i\notin \mathfrak {I}_{\diamond }$) which cannot be compensated by shortening the axiom for u only by $7\lceil \frac {|\overline {\mathfrak {I}_{\diamond }}|+|\overline {\mathfrak {I}_{v}}|}{2}\rceil $.

With a larger alphabet and certain repetitions of subwords of $w_{\mathcal {G}}$, we can modify the reduction to accommodate the rule-size, such that the arguments used for Theorem 3 still hold for this measure. To this end, we now encode 〈i〉_v and 〈i〉_◇ over 8-ary instead of 7-ary alphabets $\{x_{1},\dots ,x_{8}\}$ and $\{d_{1},\dots ,d_{8}\}$, respectively, with analogous functions f and g. Let $v^{\prime }$ and $w^{\prime }$ be defined as v and w on page 23, but with respect to the new 8-ary codewords which only means that each occurrence of ‘7’ in the definition of v and w is replaced by ‘8’. Moreover, let $u^{\prime }$ be defined as u on page 23, but with the ‘6’ of the first product replaced by ‘7’ and the ‘14n’ of the second product replaced by ‘24n + 4’ (the latter is necessary, since we need more separators of the form 〈i〉_◇). The colourings C_v and C_e remain unchanged.

In order to adapt the reduction to the rule-size measure, we have to repeat each factor #〈8i + C_v(i)〉_v and each factor 〈8i + C_v(i)〉_v# once more, but in such a way that Proposition 2 still holds, which is done by using three new symbols $₇, $₈ and ¢₃, and to add the following to $v^{\prime }$:

$$ \begin{array}{@{}rcl@{}} v^{\prime\prime} = v^{\prime}& & \prod\limits_{i=1}^{n} \left( \langle 8i+C_{v}(i) \rangle_{v} \# \langle 8i-3 \rangle_{\diamond} {\cent}_{3}\right) {\$_{7}}\\ &&\prod\limits_{i=1}^{n}\left( \# \langle 8i+C_{v}(i) \rangle_{v} {\cent}_{3} \langle 8i-3 \rangle_{\diamond}\right) {\$_{8}} . \end{array} $$

In order to also repeat once more the factors #〈8j_2i + C_v(j_2i)〉_v# to make covering edges profitable with respect to the rule-size, we repeat the complete list of edges, but every edge $(v_{j_{2i-1}}, v_{j_{2i}})$ is represented in reverse order as #〈8j_2i + C_v(j_2i)〉_v#〈8j_2i− 1 + C_v(j_2i− 1)〉_v# to make sure that no subword of the form 〈i〉_v#x_j or x_j#〈i〉_v is repeated. We further choose a new, previously not used set of separators 〈i〉_◇ (actually the 2m + 4 more for which we created codewords with u) to make sure that each factor of the form 〈i〉_◇# or #〈i〉_◇ occurs at most once. We still chose the separators according to the edge-colouring to make sure that no factors of the form 〈i〉_v#d_j or d_j#〈i〉_v are repeated; observe that by repeating the edges in reverse order, a factor of the form 〈i〉_v#d_j in $w^{\prime }$ becomes a factor of the form d_j#〈i〉_v in the reverse listing. Formally, we define:

$$ \begin{array}{@{}rcl@{}} w^{\prime\prime} = w^{\prime} \tilde{w} \# \langle 8j_{2}+C_{v}(j_{2}) \rangle_{v} \# \langle 8j_{1}+C_{v}(j_{1}) \rangle_{v}\# , \end{array} $$

where

$$ \begin{array}{@{}rcl@{}} \tilde{w} = \prod\limits_{i=m}^{2} (& \# \langle 8j_{2i}+C_{v}(j_{2i}) \rangle_{v} \# \langle 8j_{2i-1}+C_{v}(j_{2i-1}) \rangle_{v} \\ &\# \langle 8(i+m)+C_{e}(v_{j_{2(i-1)}},v_{j_{2i-1}}) \rangle_{\diamond}) . \end{array} $$

Finally, we set $w^{\prime }_{\mathcal {G}} = u^{\prime }v^{\prime \prime }w^{\prime \prime }$.

It can be easily verified that Lemma 4 remains true for the new construction; observe that appending the new part of $w^{\prime \prime }$ yields the only occurrence of the factor ## (note that $w^{\prime }$ ends with #) which implies that the old and the new part are separated in the axiom of any r-smallest grammar for $w^{\prime }_{\mathcal {G}}$. The equivalent to Lemma 5 also holds, since the part of the axiom for $u^{\prime }$ now has a length of at least $384n+64+8\lceil \frac {|\overline {\mathfrak {I}_{\diamond }}|+|\overline {\mathfrak {I}_{v}}|}{2}\rceil +1$ and the set of new rules, which now costs $4(|\overline {\mathfrak {I}_{\diamond }}|+|\overline {\mathfrak {I}_{v}}|)$, shortens this to 384n + 65 (i. e., the number of occurrences of ⋆ in $u^{\prime }$ plus 1 for $₁). Lemma 6 follows with the same arguments as before, just with 3 occurrences for each #〈i〉_v and 〈i〉_v#, which makes the rules for these subwords profitable even with respect to the rule-size. An analogue of Lemma 7 then follows exactly as before (the only addition is that the new parts of $v^{\prime \prime }$ and $w^{\prime \prime }$ are compressed in the obvious way by the existing rules). The following observation shall be helpful.

Observation 3

If HCode $\mathfrak {I} \subseteq \{1, 2, \ldots , n\}$ is such that $\{v_{i} \mid i \in \mathfrak {I}\}$ is a vertex cover, then the grammar for $w^{\prime }_{\mathcal {G}}$ according to the adapted version of Lemma 7 with respect to $\mathfrak {I}$ (see the proof of Lemma 8) satisfies $|G| = 553n + |\mathfrak {I}| + 6m + 94$ and $|G|_{\mathsf {r}} = 603n + 2|\mathfrak {I}| + 6m + 103$ (note that for the rule-size, we also have to count the start rule, so the sizes differ by the number of rules which is 50n + k + 9).

An analogous statement of Lemma 8 can now be concluded as follows. For a size-k vertex cover Γ of $\mathcal {G}$, we set $\mathfrak {I} = \{i \mid v_{i} \in {\Gamma }\}$ and then construct a grammar G for $w^{\prime }_{\mathcal {G}}$ according to the adapted version of Lemma 7 with respect to $\mathfrak {I}$ with $|G|_{\mathsf {r}} = 603n + 2|\mathfrak {I}| + 6m + 103$ (see Observation 3). On the other hand, if there is a grammar for $w^{\prime }_{\mathcal {G}}$ of rule-size 603n + 2k + 6m + 103, then, by the adapted version of Lemma 7, there is a grammar G for $w^{\prime }_{\mathcal {G}}$ with $|G|_{\mathsf {r}} = 603n + 2|\mathfrak {I}| + 6m + 103 \leq 603n + 2k + 6m + 103$ that has the form given by the adapted version of Lemma 7, with respect to some $\mathfrak {I} \subseteq \{1, 2, \ldots , n\}$. If, for some edge (v_i, v_j), $\{v_{i}, v_{j}\} \cap \mathfrak {I} = \emptyset $, then the factors #〈8i + C_v(i)〉_v#〈8j + C_v(j)〉_v# and #〈8j + C_v(j)〉_v#〈8i + C_v(i)〉_v# in $w^{\prime \prime }$ each correspond to three symbols in the axiom, and the factor #〈8i + C_v(i)〉_v# in $v^{\prime \prime }$ corresponds to two symbols in the axiom. Hence, introducing the rule $\overset {{~}_{\leftrightarrow }}{V_{i}}\rightarrow \#{\!}_{\rightarrow }{V_{i}}$ has a cost of three with respect to the rule-size and shortens the axiom by at least three. Consequently, as in the proof of Lemma 8, we can assume that ${\Gamma } = \{v_{i} \mid i \in \mathfrak {I}\}$ is a vertex cover. Since $|G|_{\mathsf {r}} = 603n + 2|\mathfrak {I}| + 6m + 103 \leq 603n + 2k + 6m + 103$, this means that Γ is a vertex cover for $\mathcal {G}$ of size at most k. Thus, we conclude that the graph $\mathcal {G}$ has a vertex cover of size k if and only if there exists a grammar of rule-size 603n + 2k + 6m + 103 for $w^{\prime }_{\mathcal {G}}$, which yields the following:

Theorem 6

SGP_r is NP-complete, even for alphabets of size 29.

Similar to Theorem 4, the above reduction can also be seen as an L-reduction (with the only change of setting β = 1329), which shows that the optimisation variant of the smallest grammar problem remains APX-hard under the rule-size measure.

Theorem 7

SGP_{r, opt} is APX-hard, even for alphabets of size 29.

We conclude that if we change from the normal size measure to the rule-size measure, NP- and APX-hardness of the smallest grammar problem over fixed alphabets remains, although the smallest alphabet size in our constructions is slightly larger. We conclude this section by another interesting observation that follows from the rule-size variant of our reduction.

Obviously, the modified reduction to SGP_r can also be interpreted as a reduction to SGP. While, on first glance, this only seems to yield a weaker hardness result compared to the one of Theorem 3, it has a nice feature that entails an interesting result in its own right. More precisely, with respect to the modified reduction and the normal size measure, every rule from Lemma 7 has a positive profit (i.e., replacing all occurrences of the nonterminal by the right side of the rule would increase the overall size) and, furthermore, every rule added in the proofs of Lemmas 5 and 6 yields a strictly smaller grammar (note that this directly follows from the correctness of the construction for the rule-size measure). Moreover, there are no repeated substrings in the grammar with this set of rules which means that no additional rules with nonnegative profit can be added. Consequently, we have not only determined the size of a smallest (with respect to |⋅|) grammar G for $w^{\prime }_{\mathcal {G}}$ to be 553n + k + 6m + 94, where k is the size of a smallest vertex cover for $\mathcal {G}$ (see Observation 3), but also that G requires exactly |G|_r −|G| = 50n + k + 9 rules (or nonterminals). Hence, the modified reduction also serves as a reduction from the vertex cover problem to the following (weaker) variant of the smallest grammar problem:

Rule Number-SGP (RN-SGP)
Instance: A word w and a $k \in \mathbb {N}$.
Question: Does there exist a smallest grammar G = (N,Σ, R, S) for w with |N|≤ k?

Theorem 8

RN-SGP is NP-hard, even for alphabets of size 29.

For the 1-level case, the original reduction already provides the analogous result (here, 1-RN-SGP denotes the variant of RN-SGP, where we ask whether there is a smallest 1-level grammar with |N|≤ k):

Theorem 9

1-RN-SGP is NP-hard, even for alphabets of size 5.

While the problems RN-SGP and 1-RN-SGP naturally arise in the context of grammar-based compression, they are particularly interesting in the light of the results presented in Section 4.1 and their relevance shall be discussed there in more detail.

3.4 (Limits of) Alphabet Reduction

As shall be discussed in this section, we can achieve a slight reduction of the alphabet size in Theorem 3. However, it seems rather unlikely that a substantial decrease is possible with our current general approach. In particular, it is suggested that a different approach is needed to prove the hardness of SGP for small, e. g., binary, alphabets.

We first note that we already saved one further unique separator of the form $_i in the construction for the rule-size by using ## instead, simply exploiting the fact that this substring of length two is not repeated anywhere else, which makes a rule containing it impossible in a smallest grammar. We can actually also shrink our alphabet in the construction used to prove Theorem 3 by saving separator symbols, more precisely, by only using one symbol $ instead of $\$_{1},\dots ,\$_{6}$. Recall that $\$_{1},\dots ,\$_{6}$ only had the purpose to cut the grammar at these symbols as described in Observation 2 and hence avoid unwanted repetitions.

As a first observation, it is not hard to see that $₂, $₄, $₅ can be removed from the $w_{\mathcal {G}}$, without creating unwanted repetitions. Removing $₂ only creates the two unwanted (in the sense that those should not repeat by Propositions 1 and 2) substrings ⋆ g(7n − 1)# and d₁#f(C_v(1)) ⋆, which do not occur elsewhere in $w_{\mathcal {G}}$ (more precisely for the second substring: y#f(C_v(1)) ⋆ with $y\notin \{x_{1},\dots ,x_{7}\}$ occurs only two other times once with y = $₁ and, after removal of $₅, once with y = ¢₂). Similar arguments hold for removing $₄ and $₅. The remaining $_i occur in the subwords: $x_{6}\$_{1}\#x_{C_{v}(1)}$, $d_{5}\$_{3}x_{C_{v}(1)}$, $d_{7}\$_{6}\#x_{C_{v}(j_{1})}$. Now consider replacing $₁, $₃, $₆ each by the same symbol $. If we make sure to list the edges in an order such that C_v(1)≠C_v(j₁), the only repeating factor of length more than one containing this new symbol $ is $#. As this subword of length two only occurs twice, it is not profitable for a smallest grammar to compress it with a rule. So with the little adjustments of deleting $₂, $₄, $₅, possible picking another order to list the edges and replacing $₁, $₃, $₆ by $, we need five symbols less for our reduction.

Further reduction of the alphabet size requires much more effort. Our main kind of argument is that certain rules cannot exist, simply because their derivative does not occur more than once in $w_{\mathcal {G}}$. There are cases, where it is possible to show that certain rules with a repeated derivative do not occur, but the respective argument cannot be local and would rather depend on the structure of the whole grammar. On the other hand, rules that we want fixed in a smallest grammar have to be provably profitable. With these properties in mind, it is quite obvious that there is not much room to reduce the alphabet size further.

The symbols ⋆, #,¢₁,¢₂ and, after applying the replacement above, $ each have a very specific purpose. It seems very difficult to reduce the alphabet by replacing one of those characters by another or some codeword.

For the symbols $x_{1},\dots ,x_{7},d_{1},\dots ,d_{7}$, we see that in Lemma 5, which fixes the codewords for vertices and separators built from these symbols, we require at least six repetitions of each desired codeword. Doing this without repeating unwanted subwords, means that, at least with the idea we used to repeat these codewords in the alternating fashion given by the subword u, we need at least six different symbols in each encoding. For the separators 〈j〉_◇, our construction requires the seven different symbols $d_{1},\dots ,d_{7}$, to have unique separators between the repetitions of the subwords #〈i〉_v, 〈i〉_v# and #〈i〉_v# in v and between the edges in the listing in w, for which we need four different kinds of separators, one for each colour of the edge-colouring C_e. For the vertex codewords 〈i〉_v, we also need seven different symbols to represent the vertex colouring C_v. So, first of all, the only way to save symbols among $x_{1},\dots ,x_{7},d_{1},\dots ,d_{7}$ seems to modify the input graph in such a way that the colourings C_e and C_v require less colours. It is possible to do this with the adjustments described in the following.

Given a subcubic graph $\mathcal {G}=(V,E)$, we first build the graph $\bar {\mathcal {G}}$ from $\mathcal {G}$ by subdividing each edge twice, i.e., we replace each edge (u, v) ∈ E by three edges (u, u_v),(u_v, v_u) and (v_u, v), where u_v and v_u are two new vertices which are not adjacent to further edges. We now construct the word for SGP to represent the graph $\bar {\mathcal {G}}$. This shift to the graph $\bar {\mathcal {G}}$ can be used to decrease the number of colours we require both for C_v and C_e. First observe that the graph $\bar {\mathcal {G}}^{2}$ (i. e., the graph obtained from $\bar {\mathcal {G}}$ by the same operation used to obtain $\mathcal {G}^{2}$ from $\mathcal {G}$ in the original reduction; see page 23) has maximum degree three, as a vertex v ∈ V is adjacent to the at most three vertices in {u_v: (u, v) ∈ E}, and a vertex v_u, added by the subdivision process for an edge (u, v), is adjacent to u and possible the at most two vertices in {v_x: (v, x) ∈ E, x≠u}. The vertex colouring C_v hence only needs four different colours to properly colour $\bar {\mathcal {G}}^{2}$.

Next, we choose a specific listing of the edges of $\bar {\mathcal {G}}$ such that the three edges of $\bar {\mathcal {G}}$ corresponding to an edge (u, v) of $\mathcal {G}$ are consecutively listed as (u_v, u),(v_u, u_v),(v, v_u) (and the relative order of such triples is arbitrary). In this way, the multi-graph $\bar {\mathcal {G}}^{\prime }$ (i. e., the graph obtained from $\bar {\mathcal {G}}$ by the same operation used to obtain $\mathcal {G}^{\prime }$ from $\mathcal {G}$ in the original reduction; see page 23) contains the edges {(u, v_u),(u_v, v): (u, v) ∈ E} for vertices from V and, in addition, we have at most one edge of the form $(u_{v}, u^{\prime }_{v^{\prime }})$ for each new vertex added by the subdivision. This means that in $\bar {\mathcal {G}}^{\prime }$, a vertex v ∈ V is only adjacent to the at most three vertices in {u_v: (u, v) ∈ E}, and a vertex u_v added by the subdivision process for the edge (u, v) is adjacent to one edge connected to v and to at most one other edge connected to a vertex added by the subdivision process different from u_v. Consequently, $\bar {\mathcal {G}}^{\prime }$ is a simple graph and of maximum degree three. Further, observe that the vertices of degree three in $\bar {\mathcal {G}}^{\prime }$ (which are a subset of the vertices in V ) form an independent set in $\bar {\mathcal {G}}^{\prime }$. By a theorem of Fournier [56], an edge-colouring for a graph with these properties, only requires three colours and can be computed in polynomial time with Vizings algorithm [57]. With the same arguments used to prove Theorem 3, it follows that a smallest grammar encodes a minimum vertex cover for $\bar {\mathcal {G}}$. It remains to observe that the size of a minimum vertex cover for the original input graph ${\mathcal {G}}$ can be derived from a minimum vertex cover for $\bar {\mathcal {G}}$. If $\mathcal {G}$ has a vertex cover of size k, then this can be extended to a vertex cover of size k + |E| for $\bar {\mathcal {G}}$ by adding exactly one of u_v and v_u for each edge (u, v) of $\mathcal {G}$. On the other hand, it can be easily seen that, without loss of generality, a minimum vertex cover for $\bar {\mathcal {G}}$ contains exactly one of u_v and v_u for each edge (u, v) of $\mathcal {G}$, and, moreover, the remaining k vertices in the vertex cover for $\bar {\mathcal {G}}$ must be a vertex cover for the graph $\mathcal {G}$.

Overall, the adjustments described so far lead to a hardness reduction which only uses an alphabet with 17 symbols, as we now only require a 6-ary encoding for vertices and separators. Observe that, although the colouring C_v only requires four colours now, we cannot reduce the alphabet for the vertices to be less than six, as we need six different symbols for the repetitions in u.

Corollary 1

SGP is NP-complete, even for alphabets of size 17.

The reduction sketched above can still be seen as an L-reduction from the optimisation version of vertex cover to SGP_opt. Too see this, observe that the adjustments made to reduce the alphabet only cause an addition of $\mathcal {O}(m)$ to the size of a smallest grammar for the word constructed for the input graph $\mathcal {G}$. As $\mathcal {O}(m)\subseteq \mathcal {O}(m^{*}_{VC}(\mathcal {G}))$ (recall that $\mathcal {G}$ is cubic), the size of the smallest grammar can be linearly bounded by $m^{*}_{VC}(\mathcal {G})$ in a similar way as shown in the proof of Theorem 4.

Corollary 2

SGP_opt is APX-hard, even for alphabets of size 17.

The only way to further reduce the alphabet would be to not just use the repetitions in u to prove Lemma 5 but the repetitions in the whole word. This however is very difficult, as including the rules we want to fix can no longer easily be shown to shorten the axiom. If there is no nonterminal V_i which derives 〈i〉_v for some index i, the larger substring #〈i〉_v¢₁ in v, for example, might still only require three symbols in the axiom by compressing parts of 〈i〉_v with # or ¢₁. Similarly for all occurrences of the substring 〈i〉_v in v or w. This problem is actually the reason, why we need the nonterminals V_i and D_i fixed for Lemma 6, to make our desired rules to derive 〈i〉_v# and #〈i〉_v in the cheapest possible way to enable the argument that other unwanted rules in N_ax cannot be more profitable. Consequently, an alphabet of size 17 seems to be necessary to cleanly prove Theorem 3 with our construction.

Similar ideas and limits for alphabet reduction hold for the rule-size measure. A reduction that only uses $ instead of $\$_{1},\dots , \$_{8}$ works analogously. The symbols $_i with i ∈{2,4,5,6,7} can be deleted without creating repetitions of unwanted subwords. Replacing the remaining $_i, i ∈{1,4,8} by $ and again reordering the edges in the listing given in $w^{\prime \prime }$ such that $x_{C_{v}(1)}\not =x_{C_{v}(j_{1})}$ makes sure that the only repeating factor of length more than one containing the new symbol $ is $#. This factor occurs exactly twice and is hence not compressed by a rule in a smallest grammar (observe that with the rule-size as measure, such a rule is not just unprofitable but even makes the grammar larger). As we here require eight repetitions to show the equivalent of Lemma 5 for the rule-size, saving symbols among $x_{1},\dots ,x_{8},d_{1},\dots ,d_{8}$ is not possible. Consequently, Theorems 6 and 7 can be improved to require only an alphabet of size 22 but a reduction with a smaller alphabet will be very difficult with our construction.

4 Smallest Grammars with a Bounded Number of Nonterminals

A natural follow-up question to the hardness for fixed alphabets is whether polynomial-time solvability is possible if instead the cardinality of the nonterminal alphabet N (or, equivalently, the number of rules) is bounded. In this section, we answer this question in the affirmative by representing words w ∈Σ^∗ as graphs Φ_m(w) and Φ₁(w), such that smallest independent dominating sets of these graphs correspond to smallest grammars and smallest 1-level grammars, respectively, for w.

It will be more convenient to first take care of the simpler 1-level case and to treat then the multi-level case as an extension of it, i. e., we first define Φ₁(w) and then derive Φ_m(w) from Φ₁(w). Recall that, as defined in Section 2, F_≥ 2(w) is the set of factors of w with size at least 2. Let Φ₁(w) = (V, E) be defined by V = V₁ ∪ V₂ ∪ V₃ and E = E₁ ∪ E₂ ∪ E₃, where:

$$ \begin{array}{@{}rcl@{}} &&V_{1} = \{(i, j) \mid 1 \leq i \leq j \leq |{w}|\}, \qquad E_{1} = \{\{(i_{1},j_{1}), (i_{2},j_{2})\} \mid i_{1}\leq i_{2} \leq j_{1}\} ,\\ &&V_{2} = \mathsf{F}_{\geq 2}(w) , \qquad\qquad\qquad\qquad\quad E_{2} = \{\{w[i..j], (i, j)\} \mid 1 \leq i <j \leq |{w}|\} ,\\ &&V_{3} = \{(u,i) \mid u \in V_{2}, 0 \leq i \leq |u|\}, \!\!\quad E_{3} = \{\{u, (u,i)\} \mid u \in V_{2}, 0 \leq i \leq |u|\} . \end{array} $$

Intuitively speaking, the vertices of V₁ represent every factor by its start and end position, whereas V₂ contains exactly one vertex per factor of length at least 2. Every u ∈ V₂ is connected to (i, j), if and only if w[i..j] = u. Vertices (i, j), $(i^{\prime }, j^{\prime })$ are connected if they refer to overlapping factors. For every u ∈ V₂, there are |u| + 1 special vertices in V₃ that are only adjacent with u. Consequently, we can view Φ₁(w) as consisting of |w| layers, where the i^th layer contains the vertices (j, j + (i − 1)) ∈ V₁, 1 ≤ j ≤|w|− (i − 1), the vertices {u ∈ V₂∣|u| = i} and the vertices {(u, j) ∈ V₃∣|u| = i,0 ≤ j ≤|u|} (see Fig. 2 for an illustration).

Next, we show that 1-level grammars for w correspond to independent dominating sets for Φ₁(w). Intuitively speaking, the vertices in an independent dominating set from V₁ induce a factorisation of w, which, in turn, induces the axiom of a 1-level grammar in the natural way (i. e., every factor of size at least 2 is represented by a rule). If (i, j) ∈ V₁ is in the independent dominating set, then w[i..j] ∈ V₂ is not; thus, due to the domination-property, all (w[i..j], ℓ) ∈ V₃, 0 ≤ ℓ ≤ j − i + 1, are in the independent dominating set, which represents the size of the rule.

Lemma 9

Let w ∈Σ^∗, k ≥ 1. There exists an independent dominating set D of cardinality at most k for Φ₁(w) if and only if there exists a 1-level grammar G for w with |G|≤ k −|F_≥ 2(w)|.

Proof

We start with the if direction. If G = (N,Σ, R, ax) is a 1-level grammar for w with size k −|F_≥ 2(w)|, then we can construct an independent dominating set D for Φ₁(w) of size k as follows. Let ax = A₁A₂…A_n, A_i ∈ N ∪Σ, 1 ≤ i ≤ n, and let $F = \{\mathfrak {D}(A) \mid A \in N\}$. For every i, 1 ≤ i ≤ n, we add $(|\mathfrak {D}(A_{1} {\ldots } A_{i-1})| + 1, |\mathfrak {D}(A_{1} {\ldots } A_{i})|) \in V_{1}$ to D and, if A_i ∈ N, then we also add all $\{(\mathfrak {D}(A_{i}), j) \mid 0 \leq j \leq |\mathfrak {D}(A_{i})|\}$ to D. Furthermore, we add all V₂ ∖ F to D. It can be easily verified that D is an independent dominating set. Moreover, $|D| = |\mathsf {ax}| + {\sum }_{v \in F} (|v| + 1) + |V_{2} \setminus F| = |\mathsf {ax}| + {\sum }_{v \in F} |v| + |V_{2}| = |\mathsf {ax}| + {\sum }_{A \in N} |\mathfrak {D}(A)| + |V_{2}| = |G| + |\mathsf {F}_{\geq 2}(w)|$. Since |G| = k −|F_≥ 2(w)|, we conclude that |D| = k.

Next, we prove the only if direction. Let D be an independent dominating set for Φ₁(w). We first note that, for every u ∈ V₂ ∖ D, $\{(u, j) \mid 0 \leq j \leq |u|\} \subseteq D$, which implies that

$$ \begin{array}{@{}rcl@{}} |D| &=& |D \cap V_{1}| + |D \cap V_{2}| + |D \cap V_{3}|\\ &\geq& |D \cap V_{1}| + |D \cap V_{2}| + \sum\limits_{u \in (V_{2} \setminus D)} \{(u, j) \mid 0 \leq j \leq |u|\}\\ &=& |D \cap V_{1}| + |D \cap V_{2}| + \sum\limits_{u \in (V_{2} \setminus D)} (|u| + 1) \\ &=& |D \cap V_{1}| + |V_{2}| + \sum\limits_{u \in (V_{2} \setminus D)} |u| . \end{array} $$

For every i, 1 ≤ i ≤|w|, we say that i is covered by $(j, j^{\prime }) \in V_{1}$ if $(j, j^{\prime }) \in D$ and $j \leq i \leq j^{\prime }$ (recall that any vertex (i, i) can only be dominated by some vertex $(j, j^{\prime })$ with $j \leq i \leq j^{\prime }$, since vertex (i, i) has no neighbours in V₂). If some i, 1 ≤ i ≤|w|, is not covered by any $(j, j^{\prime }) \in V_{1}$, then (i, i) is not dominated by D and if i is covered by two different elements from V₁, then there is an edge (from E₁) between them, so that D is not an independent set. Thus, every i, 1 ≤ i ≤|w|, is covered by exactly one element $(j, j^{\prime }) \in V_{1}$. This directly implies that D ∩ V₁ = {(ℓ₁, r₁),(ℓ₂, r₂),…,(ℓ_m, r_m)}, such that (u₁, u₂,…, u_m) is a factorisation of w, where u_j = w[ℓ_j..r_j], 1 ≤ j ≤ m. Due to the edges in E₂, we know that, for every j, 1 ≤ j ≤ m, with ℓ_j < r_j, there is an edge (u_j,(ℓ_j, r_j)); thus, u_j ∈ (V₂ ∖ D). Next, we define N = {A_u∣u ∈ (V₂ ∖ D)} and R = {A_u → u∣u ∈ (V₂ ∖ D)}. Since now for each j, 1 ≤ j ≤ m, either u_j ∈Σ or there exists a non-terminal $A_{u_{j}}$ which derives u_j, we can define an axiom of length m by $\mathsf {ax} = C_{u_{1}} C_{u_{2}} {\ldots } C_{u_{m}}$ with $C_{u_{j}}=A_{u_{j}}$ for all j with |u_j| > 1 and $C_{u_{j}}=u_{j}$ otherwise, in order to obtain a 1-level grammar G = (N,Σ, R, ax) with $\mathfrak {D}(G) = w$. Finally, we note that

$$ \begin{array}{@{}rcl@{}} |G| &=& |\mathsf{ax}| + \sum\limits_{u \in (V_{2} \setminus D)} (|u|)\\ &=& |D \cap V_{1}| + |V_{2}| + \left( \sum\limits_{u \in (V_{2} \setminus D)} |u|\right) - |V_{2}| \\ &\leq& |D| - |\mathsf{F}_{\geq 2}(w)| . \end{array} $$

□

Since in the multi-level case the derivatives of the nonterminals that appear in the axiom are again compressed by a grammar, a first idea that comes to mind is to somehow represent the vertices u ∈ V₂ again by graph structures of the type Φ₁(u) and iterating this step. However, naively carrying out this idea would lead to redundancies (copies of the subgraph representing a factor u would appear inside subgraphs representing different superstrings w₁uw₂ and $w^{\prime }_{1} u w^{\prime }_{2}$) that even seem to cause an exponential size increase of the graph structure. Fortunately, it turns out that these redundancies can be avoided and a surprisingly simple modification of Φ₁(w) is sufficient.

For a word w ∈Σ^∗, let Φ_m(w) = (V, E) be defined as follows. Let V = V₁ ∪ V₂ ∪ V₃ ∪ V₄, where V₁ and V₂ are defined as for Φ₁(w), whereas

$$ \begin{array}{@{}rcl@{}} V_{3} &=& \{(u, 0) \mid u \in V_{2}\} \text{ and}\\ V_{4} &=& \bigcup\limits_{u \in V_{2}} V_{4, u} \text{ with } V_{4, u} = \{(u, i, j) \mid 1 \leq i \leq j \leq |u|, u[i..j]\neq u\} \text{ for } u \in V_{2} . \end{array} $$

Moreover, E = E₁ ∪ E₂ ∪ E₃ ∪ E₄ ∪ E₅, where E₁ and E₂ are defined as for Φ₁(w), while

$$ \begin{array}{@{}rcl@{}} E_{3} &= &\{\{u, (u, 0)\} \mid u \in V_{2}\} \cup \{\{u, (u,i,j)\} \mid u \in V_{2}, (u,i,j) \in V_{4, u}\} ,\\ E_{4} &= &\bigcup\limits_{u \in V_{2}} E_{4, u}, \text{ with } E_{4, u} = \{\{(u, i_{1},j_{1}), (u, i_{2},j_{2})\}\subseteq V_{4,u} \mid i_{1} \leq i_{2} \leq j_{1}\},\\ &&\text{ for every } u \in V_{2}, \text{ and}\\ E_{5} &= &\{\{u, (v, i, j)\} \mid u, v \in V_{2}, v[i..j] = u,u\neq v\} . \end{array} $$

Intuitively speaking, Φ_m(w) differs from Φ₁(w) in the following way. We add to every vertex u ∈ V₂ a subgraph (V_{4, u}, E_{4, u}), which is completely connected to u and which represents u in the same way as the subgraph (V₁, E₁) of Φ₁(w) represents w, i. e., factors u[i..j] are represented by (u, i, j) and edges represent overlappings. Moreover, if a u ∈ V₂ is a factor of some v ∈ V₂, then there is an edge from u to all the vertices (v, i, j) ∈ V_{4, v} that satisfy v[i..j] = u (by these “crosslinks”, we get rid of the redundancies mentioned above). Finally, every u ∈ V₂ is also connected with an otherwise isolated vertex (u,0) ∈ V₃. See Fig. 3 for a partial illustration of a Φ_m(w).

Similar as for the 1-level case, we can show that (multi-level) grammars for w correspond to independent dominating sets for Φ_m(w):

Lemma 10

Let w ∈Σ^∗, k ≥ 1. There is an independent dominating set D of cardinality k for Φ_m(w) if and only if there is a grammar G for w with |G| = k −|F_≥ 2(w)|.

Proof

Let D be an independent dominating set of cardinality k for Φ_m(w). In the same way as in the proof of Lemma 9, it can be concluded that the set $V_{1} \cap D = \{(\ell _{1}, r_{1}), (\ell _{2}, r_{2}), \ldots , (\ell _{m_{w}}, r_{m_{w}})\}$ corresponds to a factorisation $(w_{1}, w_{2}, \ldots , w_{m_{w}})$ of w, where w_j = w[ℓ_j..r_j], 1 ≤ j ≤ m_w, and satisfies $\{w_{1}, w_{2}, \ldots , w_{m_{w}}\} \cap D = \emptyset $.

Next, for an arbitrary u ∈ V₂, we consider the subgraph with the vertices N[u] ∖ V₁ = V_{4, u} ∪{(v, i, j)∣v[i..j] = u, u≠v}∪{u,(u,0)}. If u ∈ D, then N(u) ∩ D = ∅. On the other hand, if u∉D, then (u,0) ∈ D and, analogously as for V₁, we can conclude that

$$ V_{4, u} \cap D = \{(u, \ell_{u,1}, r_{u,1}), (u, \ell_{u,2}, r_{u,2}), \ldots, (u, \ell_{u,m_{u}}, r_{u,m_{u}})\} , $$

such that $(u_{1}, u_{2}, \ldots , u_{m_{u}})$ is a factorisation of u (note that, in the same way as for V₁, if a position i of u is not covered in the sense that $(u, j, j^{\prime }) \in D$ with $j \leq i \leq j^{\prime }$, then vertex (u, i, i) would neither be in D nor adjacent to a vertex in D), where u_j = u[ℓ_{u, j}..r_{u, j}], 1 ≤ j ≤ m_u. Furthermore, for every j, 1 ≤ j ≤ m_u, with |u_j|≥ 2, {u_j,(u, ℓ_{u, j}, r_{u, j})}∈ E; thus, u_j∉D. Consequently, by induction, D induces a factorisation $(u_{1}, u_{2}, \ldots , u_{m_{u}})$ for every u ∈ (V₂ ∖ D) ∪{w}, such that, for every i, 1 ≤ i ≤ m_u, |u_j|≥ 2 implies u_j ∈ V₂ ∖ D, which means that there is also a factorisation for u_j.

For every u ∈ V₂ ∖ D, we can now define a nonterminal A_u and a rule $A_{u} \to B_{1} B_{2} {\ldots } B_{m_{u}}$, where, for every j, 1 ≤ j ≤ m_u, $B_{j} = A_{u_{j}}$ if |u_j|≥ 2 and B_j = u_j if |u_j| = 1. Obviously, these rules together with the axiom $\mathsf {ax} = C_{1} C_{2} {\ldots } C_{m_{w}}$, where, for every j, 1 ≤ j ≤ m_w, $C_{j} = A_{w_{j}}$ if |w_j|≥ 2 and C_j = w_j if |w_j| = 1, defines a grammar G for w.

We note that |ax| = |V₁ ∩ D| and, for every rule A_u → α_u, |α_u| = |V_{4, u} ∩ D|. Since

$$ \begin{array}{@{}rcl@{}} |D| &= &|D \cap V_{1}| + |(D \cap (\bigcup\limits_{u \in V_{2}} V_{4, u}))| + |D \cap (V_{2} \cup V_{3})| ,\\ |V_{2}| &= &|D \cap (V_{2} \cup V_{3})| \text{ and}\\ |G| &= &|D \cap V_{1}| + |(D \cap (\bigcup\limits_{u \in V_{2}} V_{4, u}))| , \end{array} $$

we conclude that |G| = |D|−|V₂| = k −|F_≥ 2(w)|.

For a grammar G for w, we can select vertices from Φ_m(w) according to the factorisations induced by the rules of G, which results in an independent dominating set D for Φ_m(w) with |D| = |G| + |V₂|. □

For the algorithmic application of these graph encodings, it is important to note that the proofs of Lemmas 9 and 10 are constructive, i. e., they also show how an independent dominating set D of Φ_m(w) or Φ₁(w) can be transformed into a grammar for w (a 1-level grammar for w, respectively) of size |D|−|F_≥ 2(w)|, which, in the following, we will denote by G(D).

Thus, the smallest grammar problem can be solved by constructing Φ_m(w) or Φ₁(w), then computing a smallest independent dominating set D for Φ_m(w) (or Φ₁(w), respectively) and finally constructing G(D). Unfortunately, this does not lead to a polynomial-time algorithm, since computing a minimal independent dominating set is an NP-complete problem, even for quite restricted graph classes [58, Theorem 13].

In the following, we shall analyse the graph structures Φ_m(w) and Φ₁(w) more thoroughly and we begin with their respective sizes:

Proposition 3

Let w ∈Σ^∗. Then Φ₁(w) has $\mathcal {O}(|{w}|^{3})$ vertices and $\mathcal {O}(|{w}|^{4})$ edges; Φ_m(w) has $\mathcal {O}(|{w}|^{4})$ vertices and $\mathcal {O}(|{w}|^{6})$ edges.

Proof

We first consider Φ_m(w). The subgraph (V₁, E₁) has $\mathcal {O}(|{w}|^{2})$ vertices and $\mathcal {O}(|{w}|^{4})$ edges. Similarly, every induced subgraph on the set of vertices V_{4, u} ∪{u,(u,0)}, u ∈ V₂ has $\mathcal {O}(|{w}|^{2})$ vertices, $\mathcal {O}(|{w}|^{4})$ edges and there are $\mathcal {O}(|{w}|^{2})$ such subgraphs. In addition to this, there are $\mathcal {O}(|{w}|)$ edges connecting any u ∈ V₂ with vertices from V₁ and $\mathcal {O}(|{w}|^{2})$ edges connecting any u ∈ V₂ with vertices from V₄. Finally, there are $\mathcal {O}(|{w}|^{2})$ vertices in V₃ with one incident edge each. Consequently, Φ_m(w) has $\mathcal {O}(|{w}|^{4})$ vertices and $\mathcal {O}(|{w}|^{6})$ edges.

For Φ₁(w), the situation is easier. The subgraph (V₁, E₁) has $\mathcal {O}(|{w}|^{2})$ vertices and $\mathcal {O}(|{w}|^{4})$ edges. There are $\mathcal {O}(|{w}|^{2})$ vertices in V₂ and each u ∈ V₂ has $\mathcal {O}(|{w}|)$ edges. Finally, there are $\mathcal {O}(|{w}|^{2})$ vertices in V₃ with one edge each. Consequently, Φ₁(w) has $\mathcal {O}(|{w}|^{3})$ vertices and $\mathcal {O}(|{w}|^{4})$ edges. □

Next, we investigate the interval-structure of Φ_m(w) and Φ₁(w).

Proposition 4

Φ_m(w) and Φ₁(w) are 2-interval graphs.

Proof

In the following 2-interval representations, we denote by I₁(v) the first and by I₂(v) the second interval that represents a vertex v.

We first consider the graph Φ₁(w). For every (i, j) ∈ V₁, we set I₁((i, j)) = [i, j]; this already yields the subgraph (V₁, E₁). In addition, let I₁(u), u ∈ V₂, be a sequence of pairwise disjoint intervals that are also disjoint with the intervals I₁((i, j)), (i, j) ∈ V₁. For every (u, j) ∈ V₃, let I₁((u, j)) be an interval that lies within I₁(u) and is disjoint from every other interval. Now, it only remains to represent the edges from E₂, for which we simply let I₂((i, j)), (i, j) ∈ V₁, be an interval that lies within I₁(w[i..j]) and is disjoint from every other interval. Note that only the vertices from V₁ are represented by two intervals each.

For Φ_m(w), we represent V₁ ∪ V₂ and the edges E₁ ∪ E₂ by intervals in the same way as for the graph Φ₁(w). Then, for every u ∈ V₂ and (u, i, j) ∈ V_{4, u}, we set I₁((u, i, j)) = [i + k_u, j + k_u], where k_u is chosen such that all these intervals lie inside I₁(u) without intersecting an interval I₂((i, j)) for some (i, j) ∈ V₁. In particular, this takes care of all the edges E_{4, u} (due to the intersections between these intervals) and the edges between u and the vertices V_{4, u} (due to the fact that these intervals lie inside I₁(u)). In order to take care of the edges from E₅, for every u and for every (v, i, j) ∈ V_{4, v} with v[i..j] = u, we place a new interval I₂((v, i, j)) inside of I₁(u) such that it does not intersect with any other interval inside of I₁(u). This creates all the edges from E₅. Now it only remains to take care of vertices (u,0), u ∈ V₂, and their edges, which can be done by placing a new interval I₁((u,0)) inside I₁(u) such that it does not intersect with any other interval. □

Unfortunately, the independent dominating set problem for 2-interval graphs is still NP-complete (in [58], the hardness of the independent dominating set problem for subcubic graphs is shown and from [59], it follows that subcubic graphs are 2-interval graphs). Nevertheless, solving the smallest grammar problem by computing small independent dominating sets for Φ_m(w) or Φ₁(w), as sketched before Proposition 3, might still be worthwhile, since computing small independent dominating sets is a well-researched problem, for which the literature provides fast and sophisticated algorithms (see [60, 61]). In particular, the 2-interval structure suggests that we are dealing with simpler instances of the independent dominating set problem.

Our algorithmic application of the graph encodings, which leads to the polynomial-time solvability of the smallest grammar problem with a bounded number of nonterminals, can be sketched as follows. If we have fixed the set of factors $F \subseteq \mathsf {F}_{\geq 2}(w)$ to occur as derivatives of nonterminals in the grammar, i. e., $\{\mathfrak {D}(A) \mid A \in N\} = F$, then, for the corresponding independent dominating set D of Φ_m(w) or Φ₁(w), we must have $(\mathsf {F}_{\geq 2}(w) \setminus F) \subseteq D$ and F ∩ D = ∅. Thus, in order to find an independent dominating set that is minimal among all those that correspond to a grammar with $\{\mathfrak {D}(A) \mid A \in N\} = F$, it is sufficient to first select the vertices (F_≥ 2(w) ∖ F), deleting the neighbourhood of this vertex set and computing a smallest independent dominating set for what remains, which is the graph ${\mathscr{H}} = {\Phi }(w) \setminus (N[\mathsf {F}_{\geq 2}(w) \setminus F] \cup F)$.^{Footnote 10} However, ${\mathscr{H}}$ is an interval graph, so a smallest independent dominating set can be computed in linear time.

In order to carry out this approach, we first formally prove that ${\mathscr{H}}$ is an interval graph:

Proposition 5

Let w ∈Σ⁺, $F \subseteq \mathsf {F}_{\geq 2}(w)$ and Φ(w) ∈{Φ_m(w),Φ₁(w)}. Then ${\mathscr{H}} = {\Phi }(w) \setminus (N[\mathsf {F}_{\geq 2}(w) \setminus F] \cup F)$ is an interval graph.

Proof

We only prove the case Φ(w) = Φ_m(w), since the case Φ(w) = Φ₁(w) can be handled analogously. First, we consider the 2-interval representation of Φ_m(w) described in the proof of Proposition 4. We can now obtain a 1-interval representation of ${\mathscr{H}}$ from the 2-interval representation of Φ_m(w) as follows. Since ${\mathscr{H}}$ does not contain any vertex from V₂, we first remove the corresponding intervals for vertices from V₂. The only vertices represented by more than one interval are the ones from V₁ and V₄. However, the second intervals of these only intersect intervals which represent vertices from V₂ in the 2-interval representation of Φ_m(w), which means that they are now all isolated and can therefore be removed. Consequently, every vertex of ${\mathscr{H}}$ can be represented by one interval.□

Next, we show that independent dominating sets for ${\mathscr{H}}$ can be easily extended to independent dominating sets for Φ_m(w) (or Φ₁(w)).

Proposition 6

Let w ∈Σ⁺, $F \subseteq \mathsf {F}_{\geq 2}(w)$, Φ(w) ∈{Φ_m(w),Φ₁(w)} and let $D_{{\mathscr{H}}}$ be an independent dominating set for ${\mathscr{H}} = {\Phi }(w) \setminus (N[\mathsf {F}_{\geq 2}(w) \setminus F] \cup F)$. Then $D_{{\mathscr{H}}} \cup (\mathsf {F}_{\geq 2}(w) \backslash F)$ is an independent dominating set for Φ(w).

Proof

We start with the multi-level case. Since $D_{{\mathscr{H}}}$ is an independent dominating set for ${\mathscr{H}}$, it is also an independent set for Φ_m(w). The only vertices of Φ_m(w) that are not necessarily dominated by $D_{{\mathscr{H}}}$ are from N[F_≥ 2(w) ∖ F] or F. Since $\mathsf {F}_{\geq 2}(w) \setminus F \subseteq D_{{\mathscr{H}}} \cup (\mathsf {F}_{\geq 2}(w) \backslash F)$, the vertices from N[F_≥ 2(w) ∖ F] are dominated by $D_{{\mathscr{H}}} \cup (\mathsf {F}_{\geq 2}(w) \backslash F)$. Regarding the vertices from F, we note that since $F \cap D_{{\mathscr{H}}} = \emptyset $, the vertices {(u,0)∣u ∈ F} occur in ${\mathscr{H}}$ as isolated vertices and, thus, they must be included in $D_{{\mathscr{H}}}$, which means that the vertices F are dominated in Φ_m(w) by $D_{{\mathscr{H}}} \cup (\mathsf {F}_{\geq 2}(w) \backslash F)$ as well. Now it only remains to observe that, by definition of Φ_m(w), the vertices (F_≥ 2(w)∖F) are clearly independent and, since their neighbourhood is completely excluded from ${\mathscr{H}}$ and therefore also from $D_{{\mathscr{H}}}$, they are also independent from the vertices in $D_{{\mathscr{H}}}$. Consequently, $D_{{\mathscr{H}}} \cup (\mathsf {F}_{\geq 2}(w) \backslash F)$ is an independent dominating set for Φ_m(w).

The argument for the 1-level case is very similar with the only difference that {(u, i)∣u ∈ F,0 ≤ i ≤|u|} are the vertices from $D_{{\mathscr{H}}} \cup (\mathsf {F}_{\geq 2}(w) \backslash F)$ that dominate the vertices F. □

For the sake of convenience, for any $F \subseteq \mathsf {F}_{\geq 2}(w)$, we denote a grammar G = (N,Σ, R, ax) for w with $\{\mathfrak {D}(A) \mid A \in N\} = F$ by the term F-grammar, a smallest F-grammar for w is one that is minimal among all F-grammars for w.

Lemma 11

Let w ∈Σ⁺ and $F \subseteq \mathsf {F}_{\geq 2}(w)$. A smallest F-grammar for w can be computed in time $\mathcal {O}(|{w}|^{6})$ and a smallest 1-level F-grammar for w can be computed in time $\mathcal {O}(|{w}|^{4})$.

Proof

Again, we only prove the multi-level case, since the 1-level case can be dealt with analogously. We compute a smallest F-grammar for w as follows. First, we construct Φ_m(w) and then ${\mathscr{H}} = {\Phi }_{m}(w) \setminus (N[\mathsf {F}_{\geq 2}(w) \setminus F] \cup F)$, which can be done in time $\mathcal {O}(|{\Phi }_{m}(w)|) = |{w}|^{6}$ (see Proposition 3). Obviously, we could also construct ${\mathscr{H}}$ directly, which would not change the overall running-time. Next, we compute a minimal independent dominating set $D_{{\mathscr{H}}}$ for ${\mathscr{H}}$, which, since ${\mathscr{H}}$ is an interval graph (see Proposition 5), can be done in time $\mathcal {O}(|{\mathscr{H}}|) = \mathcal {O}(|{w}|^{6})$ (see Section 2.1). Finally, we construct $G = \mathsf {G}(D_{{\mathscr{H}}} \cup (\mathsf {F}_{\geq 2}(w) \setminus F))$ (note that, by Proposition 6, $D_{{\mathscr{H}}} \cup (\mathsf {F}_{\geq 2}(w) \setminus F)$ is an independent dominating set for Φ_m(w); thus, G is well-defined), which can be done in time $\mathcal {O}(|{w}|^{6})$ as well.

It remains to prove that G is a smallest F-grammar. To this end, we assume that there exists an F-grammar $G^{\prime }$ for w and $|G^{\prime }| < |G|$. Consequently, by Lemma 10, there is an independent dominating set $D^{\prime }$ for Φ_m(w) with $|G^{\prime }| = |D^{\prime }| - |\mathsf {F}_{\geq 2}(w)|$. Since both G and $G^{\prime }$ are F-grammars, $\mathsf {F}_{\geq 2}(w) \setminus D =\mathsf {F}_{\geq 2}(w) \setminus D^{\prime }=F$. This implies that $D^{\prime }_{{\mathscr{H}}} = D^{\prime } \setminus (\mathsf {F}_{\geq 2}(w) \setminus F)$ is an independent dominating set for ${\mathscr{H}}$. Since by Lemma 10, |G| = |D|−|F_≥ 2(w)| and, by assumption, $|D^{\prime }| < |D|$, it follows that $|D^{\prime }_{{\mathscr{H}}}| < |D_{{\mathscr{H}}}|$, which is a contradiction to the minimality of $D_{{\mathscr{H}}}$. Consequently, G is a smallest F-grammar for w. □

If instead of a set F of factors, we are only given an upper bound k on |N|, then we can compute a smallest grammar by enumerating all $F \subseteq \mathsf {F}_{\geq 2}(w)$ with |F|≤ k and computing a smallest F-grammar. This shows that smallest grammars can be computed in polynomial time if the number of nonterminals is bounded.

Theorem 10

Let w ∈Σ^∗ and $k \in \mathbb {N}$. A grammar (1-level grammar, resp.) for w with at most k rules that is smallest among all grammars (1-level grammars, resp.) for xw with at most k rules can be computed in time $\mathcal {O}(|{w}|^{2k+6})$ ($\mathcal {O}(|{w}|^{2k+4})$, resp.).

Proof

Obviously, a grammar G for w with k rules and

$$ |G| = \min\{|G^{\prime}| \mid G^{\prime} \text{ is smallest \textit{F}-grammar, with } F \subseteq \mathsf{F}_{\geq 2}(w), |F| \leq k\} $$

is smallest among all grammars for w with at most k rules. In order to compute such a grammar, it is sufficient to compute, for every set $F \subseteq \mathsf {F}_{\geq 2}(w)$ with |F|≤ k, a smallest F-grammar, which requires time $\mathcal {O}(|{w}|^{2k} \cdot |{w}|^{6}) = \mathcal {O}(|{w}|^{2k+6})$.

Analogously, we can compute a 1-level grammar for w with at most k rules that is smallest among all 1-level grammars for w with at most k rules in time $\mathcal {O}(|{w}|^{2k+4})$. □

This result raises some related questions, which shall be discussed next.

4.1 Related Questions

In the literature on grammar-based compression, the size of a smallest grammar has been interpreted in terms of a computable upper bound of the Kolomogorov complexity and, thus, as some measure for entropy or information content of strings (see Section 1). Similarly, we could treat the minimal number of nonterminals (i. e., number of rules) that are needed for a smallest grammar as a general parameter of strings, which we call the rule-number. The main motivation for doing this is pointed out by Theorem 10, which shows that a smallest grammar for w can be computed in time that is exponential only in the rule-number of w (or, in parameterised complexity terms, the smallest grammar problem parameterised by |N| is in XP). However, in order to apply the algorithm of Theorem 10 in this regard, we need to know the rule-number, which naturally leads to the question whether the rule-number of a given string can efficiently be computed. However, the hardness reductions for the rule-size variants of the smallest grammar problem (see Section 3.3) has already provided a negative answer to this question (see Theorems 8 and 9).

The XP-membership of the smallest grammar problem, provided by Theorem 10, shows that the parameter |N| has a stronger impact on the complexity than |Σ| and, furthermore, it gives reason to hope that bounding |N| might also lead to practically relevant algorithms. In this regard, the algorithm of Theorem 10 with its running-time of the form $|{w}|^{\mathcal {O}(|N|)}$ is a bit dissapointing, since it cannot be considered practical for larger constant bounds on |N|. On the other hand, an algorithm with a running-time of f(|N|) ⋅ g(|w|), for a polynomial g, would be a huge improvement. In other words, the question is whether the smallest grammar problem is also fixed-parameter tractable with respect to the number of nonterminals. Unfortunately, this seems unlikely, since, as stated by the next result, these parameterisations of 1-SGP and SGP are W[1]-hard. To prove this, we devise a parameterised reduction from the independent set problem parameterised by the size of the independent set, which is known to be W[1]-hard (see [62]).

Let $\mathcal {G} = (V, E)$ be a graph with V = {v₁, v₂,…, v_n}, |E| = m, and let $k \in \mathbb {N}$. We define the alphabet ${\Sigma } = V \cup \{\#\} \cup \{\diamond _{i} \mid 1 \leq i \leq m + {\sum }^{n}_{i = 1} n - |N(v_{i})|\}$ and the following word over Σ

$$ w = \prod\limits_{\{v_{i}, v_{j}\} \in E}(\#v_{i}\#v_{j}\#\diamond) \prod\limits^{n}_{i = 1}(\#v_{i}\#\diamond)^{n-|N(v_{i})|} . $$

As already done in Section 3, every occurrence of ◇ in the word stands for a distinct symbol of $\{\diamond _{i} \mid 1 \leq i \leq m + {\sum }^{n}_{i = 1} n - |N(v_{i})|\}$). Note that |w| = 6m + 4(n² − 2m) = 4n² − 2m.

Lemma 12

The following statements are equivalent for each k ≤ n:

$\mathcal {G}$ has an independent set I with |I| = k.
There is a grammar G for w with at most k nonterminals and |G|≤ 4n² − 2m + 3k − 2kn.
There is a 1-level grammar G for w with at most k nonterminals and |G|≤ 4n² − 2m + 3k − 2kn.

Proof

We first prove the equivalence of the first and the third statement. Let I be an independent set for $\mathcal {G}$ with |I| = k. We define a grammar G = (N,Σ, R, ax) by N = {A_i∣v_i ∈ I}, R = {A_i → #v_i#∣A_i ∈ N} and $\mathsf {ax} = w^{\prime }$, where $w^{\prime }$ is obtained from w, by replacing, for every v_i ∈ I, all occurrences of #v_i# by A_i (note that since I is an independent set, no two occurrences of factors #v_i# and #v_j# with v_i, v_j ∈ I overlap). Obviously, G is a 1-level grammar for w with k nonterminals. For every v_i ∈ I, $|\mathsf {ax}|_{A_{i}} = |N(v_{i})| + (n - |N(v_{i})|) = n$; thus, p(A_i) = 2n − 3 (recall that the concept of the profit p(A) of a nonterminal A of a 1-level grammar is defined on page 11). Consequently, $|G| = |{w}| - {\sum }_{A \in V} \mathsf {p}(A) = 4n^{2}-2m - k(2n - 3)$.

Let G = (N,Σ, R, ax) be a 1-level grammar of size at most 4n² − 2m − 2kn + 3k, with at most k nonterminals. We note that, for every A ∈ N, p(A) ≤ 2n − 3, since in w every repeated factor has size of at most 3 and is repeated at most n times. Since, by assumption, |G|≤ 4n² − 2m − k(2n − 3) and $|G| = 4n^{2}-2m - {\sum }_{A \in N} \mathsf {p}(A)$, we conclude that ${\sum }_{A \in N} \mathsf {p}(A) \geq k(2n - 3)$. Hence, there are exactly k nonterminals A ∈ N each with a right side of length 3, which implies A → #v_i#, for some i, 1 ≤ i ≤ n, and, furthermore, |ax|_A = n. It can be easily verified that this is only possible if {v_i∣there is (A → #v_i#) ∈ R} is an independent set for $\mathcal {G}$.

The third statement obviously implies the second statement. We assume that the second statement holds, i. e., there is a grammar G = (N,Σ, R, ax) for w with at most k nonterminals and |G|≤ 4n² − 2m + 3k − 2kn. If G is not a 1-level grammar, then it has a rule A → α with α∉Σ⁺ and, since the only repeated factors of w with a length of at least 3 have the form #x#, for some $x \in \{v_{1},\dots , v_{n}\}$, we also know that $\mathfrak {D}(A) = \# x \#$. In particular, this implies that α = B# or α = #B with B → #x ∈ R or B → x# ∈ R. Generally, each rule in G has a length (and hence cost) of at least 2, compresses a factor of length at most 3 and occurs in the axiom at most n times. The rules A and B together can occur at most n times in ax, as they both derive the symbol x. This means that the axiom has a length of at least |w|− (k − 1)2n and therefore the overall grammar has size of at least |ax| + 2k = 4n² − 2m − 2kn + 2n + 2k. Since we assumed that |G|≤ 4n² − 2m + 3k − 2kn, this implies 4n² − 2m − 2kn + 2n + 2k ≤ 4n² − 2m + 3k − 2kn, so 2n ≤ k which contradicts the assumption k ≤ n. □

Lemma 12 directly yields the following result:

Theorem 11

1-SGP and SGP parameterised by |N| are W[1]-hard.

We emphasise that Theorem 11 shows W[1]-hardness for the smallest grammar problem parameterised by |N| only for the case where the terminal alphabet Σ is unbounded. The most important respective question, which, unfortunately, is left open here, is whether the smallest grammar problem is fixed-parameter tractable with respect to the combined parameter (|N|,|Σ|) (we discuss the open cases of the parameterised complexity of the smallest grammar problem in more detail in Section 6).

Finally, we note that we can use Lemma 11 in order to obtain a simple exact exponential-time algorithm for the smallest grammar problem. More precisely, we compute for each subset $F \subseteq \mathsf {F}_{\geq 2}(w)$ a smallest F-grammar, which yields an algorithm with an overall running-time of $2^{\mathcal {O}(|{w}|^{2})}$. In the next section, we present more advanced exact exponential-time algorithms for SGP and 1-SGP.

5 Exact Exponential-Time Algorithms

An obvious approach for an exact exponential-time algorithm for SGP is to enumerate all ordered trees with |w| leaves and to interpret them as derivation trees of a grammar for w. More precisely, for a given ordered tree with |w| leaves, we first label the leaves with the symbols of w and then we inductively label each internal node with u₁u₂…u_k, where u_i are the labels of its children nodes. Finally, for every factor u that occurs as a label of some internal node, we substitute all occurrences of this label by a nonterminal A_u. In order to estimate the number of such trees, we first note that the i^th Catalan number C_i is the number of full binary trees (i. e., every non-leaf has exactly two children) with i + 1 leaves. Moreover, every tree with |w| leaves can be obtained from a full binary tree with |w| leaves by contracting some of its ‘non-leaf’ edges (i. e., edges not incident to a leaf). Since every full binary tree with |w| leaves has less than |w| such ‘non-leaf’ edges, the number of trees that we have to consider is at most C_|w|− 1 ⋅ 2^|w|. Since $C_{|{w}|-1} \in \mathcal {O}(4^{|{w}|-1})$, this leads to an algorithm with running-time $\mathcal {O}^{*}(8^{|{w}|})$.

In the following, we shall give more sophisticated exact exponential-time algorithms with running times $\mathcal {O}^{*}(1.8392^{|{w}|})$, for the 1-level case, and $\mathcal {O}^{*}(3^{|{w}|})$, for the multi-level case. First, we need to introduce some helpful notations.

Let G = (N,Σ, R, ax) be a grammar for w and let α = A₁…A_k, A_i ∈ (Σ ∪ N), 1 ≤ i ≤ k. The factorisation of $\mathfrak {D}(\alpha )$ induced by α is the tuple $(\mathfrak {D}_{G}(A_{1}),\dots ,\mathfrak {D}_{G}(A_{k}))$. Furthermore, the factorisation of w induced by ax is called the factorisation of w induced by G. A factorisation q = (u₁, u₂,…, u_k) of a word w with |w| = n can be characterised by the vector $v_{q}\in \{0,1\}^{n-1}$ defined by setting v_q[i] = 1 if and only if i = |u₁…u_j| for some 1 ≤ j < k. For the sake of convenience, we implicitly assume v_q[0] = v_q[n] = 1, and treat vectors as words over the alphabet $\mathbb {N}$, which allows us to use notations already defined for words. From now on, we shall use these two representations of factorisations, i. e., tuples of factors and vectors in {0,1}^n− 1, interchangeably, without mentioning it.

5.1 The 1-Level Case

In the 1-level case, as long as we are only concerned with smallest grammars, the factorisation induced by the axiom already fully determines the grammar. More formally, let $q=(u_{1},u_{2},\dots ,u_{k})$ be a factorisation for a word w and let F_q = {u_i∣1 ≤ i ≤ k,|u_i|≥ 2}. We define the 1-level grammar G_q = (N_q,Σ, R_q, ax_q) by R_q = {(A_u, u): u ∈ F_q}, N_q = {A_u: u ∈ F_q} and ax_q = B₁…B_k with $B_{j} = A_{u_{j}}$, if u_j ∈ F_q and B_j = u_j, otherwise.

Lemma 13

For any factorisation $q=(u_{1},u_{2},\dots ,u_{k})$ for w, G_q is a smallest grammar among all 1-level grammars for w that induce the factorisation q.

Proof

Let $q=(u_{1},u_{2},\dots ,u_{k})$ be a factorisation for a word w. Every 1-level grammar G = (N,Σ, R, ax) for w that induces q satisfies $|G| = k + {\sum }_{A \in N} |\mathfrak {D}(A)| \geq k + {\sum }_{u \in F_{q}} |u|$. Since $|G_{q}| = k + {\sum }_{u \in F_{q}} |u|$, G_q is a smallest 1-level grammar for w that induces q. □

Choosing the smallest among all grammars {G_q∣q is a factorisation of w} yields an $\mathcal {O}^{*}(2^{n})$ algorithm for 1-SGP. However, it is not necessary to enumerate factorisations that contain at least two consecutive factors of length 1, which improves this result as follows.

Theorem 12

1-SGP can be solved exactly in polynomial space and in time $\mathcal {O}^{*}(1.8392^{|{w}|})$.

Proof

For any $k \in \mathbb {N}$, let Γ_k contain all q ∈{0,1}^k, such that v has no prefix 11, no suffix 11 and no factor 111; furthermore, let ${\Gamma }^{\prime }_{k}$ contain all q ∈{0,1}^k, such that v has no suffix 11 and no factor 111. Clearly, Γ_|w|− 1 contains exactly the factorisations for w that have no consecutive factors of length 1. In order to solve the smallest 1-level grammar problem, we enumerate Γ_|w|− 1 and for every q ∈Γ_|w|− 1, we construct G_p, where p is obtained from q, by replacing every non-repeated factor u of q with the factors u[1], u[2],…, u[|u|]. It remains to prove the correctness of this algorithm and to estimate its running-time.

To this end, let G be a smallest 1-level grammar for w and let p = (u₁, u₂,…, u_k) be the factorisation induced by G. Furthermore, let q be the factorisation obtained from p by joining any maximal sequence u_i, u_i+ 1,…, u_j, 1 ≤ i < j ≤ k, of factors with |u_ℓ| = 1, i ≤ ℓ ≤ j (note that q ∈Γ_|w|− 1). If none of the newly constructed factors of q is repeated, then the algorithm, when enumerating q, constructs grammar G_p that, according to Lemma 13, is smallest among all 1-level grammars for w that induce p; thus, G_p is a smallest 1-level grammar. If, on the other hand, any of these newly constructed factors is repeated and has a length of at least 3, or has length 2 and is repeated for at least 3 times, then a 1-level grammar smaller than G could be constructed, which is a contradiction. This leaves the case where all newly constructed factors of q have length 2 and are repeated exactly twice. In this case the algorithm will, when enumerating q, construct a grammar that differs from G_p only in that it compresses some factors of length 2 that are repeated only twice, and that G_p does not compress. This grammar has obviously the same size as G_p and is therefore a smallest 1-level grammar as well.

In order to estimate the running-time, let T(k) = |Γ_k| and $T^{\prime }(k) = |{\Gamma }^{\prime }_{k}|$, for every $k \in \mathbb {N}$. Obviously,

$$ T(k) = |\{q \in {\Gamma}_{k}:q[1] = 0\}| + |\{q \in {\Gamma}_{k}:q[1] = 1\}| , $$

so, in the following, we shall determine |{q ∈Γ_k∣q[1] = 0}| and |{q ∈Γ_k∣q[1] = 1}| separately. To this end, we first note that $|\{q \in {\Gamma }_{k} \mid q[1] = 1\}| = T(k-1) - T^{\prime }(k-3)$ (this is due to the fact that T(k − 1) also counts all $q = 110q^{\prime }\ldots $ with $q^{\prime } \in {\Gamma }^{\prime }_{k-3}$, so we have to subtract $T^{\prime }(k-3)$). Moreover,

$$ \begin{array}{@{}rcl@{}} |\{q \in {\Gamma}_{k} \mid q[1]q[2] = 01\}| &=& T(k-2) ,\\ |\{q \in {\Gamma}_{k} \mid q[1]q[2]q[3] = 001\}| &=& T(k-3) ,\\ |\{q \in {\Gamma}_{k} \mid q[1]q[2]q[3] = 000\}| &=& T^{\prime}(k-3) . \end{array} $$

This is due to the fact that extending the prefix 01 or 001 with 11 yields a factor 111, where the prefix 000 can be extended by 11. With the above observations, we can now conclude the following:

$$ \begin{array}{@{}rcl@{}} T(k) &= &|\{q \in {\Gamma}_{k} \mid q[1] = 0\}| + |\{q \in {\Gamma}_{k} \mid q[1] = 1\}|\\ &= &|\{q \in {\Gamma}_{k} \mid q = 01\ldots\}| + |\{q \in {\Gamma}_{k} \mid q = 001\ldots\}| +\\ &&|\{q \in {\Gamma}_{k} \mid q = 000\ldots\}| + |\{q \in {\Gamma}_{k} \mid q[1] = 1\}|\\ &= &T(k-2) + T(k-3) + T^{\prime}(k-3) + T(k-1) - T^{\prime}(k-3)\\ &= &T(k-1) + T(k-2) + T(k-3) . \end{array} $$

This yields $T(k) = \mathcal {O}(1.8392^{k})$; since we can also enumerate Γ_|w|− 1 in time $\mathcal {O}^{*}(1.8392^{|{w}|})$, the algorithm has a running-time of $\mathcal {O}^{*}(1.8392^{|{w}|})$. □

5.2 The Multi-Level Case

The obvious idea for a dynamic programming algorithm is to build up grammars level by level, e. g., by starting with a 1-level grammar, then extending it by a new axiom, which can derive the old axiom in one derivation step, and iterating this procedure. Obviously, we have to try an exponential number of axioms, which will lead to an exponential-time algorithm (as suggested by the NP-completeness of the problem). However, there is a more fundamental problem with this general approach, which shall be pointed out by going a bit more into detail.

For every i and every factorisation p of w, we store in entry T[i, p] of a table the size of a smallest i-level grammar with an axiom ax that induces factorisation p (in the sense defined at the beginning of this section). Then, for every factorisation q, such that p is a refinement of q, we construct a new axiom $\mathsf {ax}^{\prime }$ that induces factorisation q and that can derive ax in one step, which is treated as the axiom of a new (i + 1)-level grammar. We subtract the profit of the rules needed to derive ax from $\mathsf {ax}^{\prime }$ to T[i, p] and store the obtained number in T[i + 1, q]. Note that the axioms ax and $\mathsf {ax}^{\prime }$ are fully determined by the factorisations p and q (similar as a factorisation determines a smallest 1-level grammar with an axiom inducing this factorisation, see Lemma 13). However, this approach is fundamentally flawed, since in order to compute the size of the new (i + 1)-level grammar, we need to know whether the rules needed to derive ax from $\mathsf {ax}^{\prime }$ have already been used earlier in the i-level grammar and therefore are already counted by T[i, p], or whether they are newly introduced. On the other hand, it should clearly be avoided to additionally store all previously used rules as well.

To overcome this problem, we do not consider the levels of a grammar as strings ax, D(ax), D(D(ax)),…, w, which is the obvious choice, but we define them in such a way that all occurrences of a nonterminal are on the same level. With this definition, all the rules that are needed for the extension to the new level must be completely new rules without prior application; thus, a dynamic programming approach similar to the one described above will be successful. Next, we give the required definitions (which are also illustrated by Example 2).

For a d-level grammar G = (N,Σ, R, ax), we partition the set of nonterminals N according to the number of derivation steps that are necessary to derive a terminal word (or, equivalently, according to their height, i. e., the maximum distance to a leaf in the derivation tree). More precisely, let $N_{1},\dots ,N_{d}$ be the partition of N into $N_{i}=\{A\in N \mid ({\mathsf {D}^{i}_{G}}(A)\in {\Sigma }^{+})\wedge (\mathsf {D}^{i-1}_{G}(A)\notin {\Sigma }^{+})\}$. We recall that the morphism D : (N ∪Σ)^∗→ (N ∪Σ)^∗ replaces every occurrence of a nonterminal by the right side of its rule. For every i, 1 ≤ i ≤ d, we modify D, such that it only considers nonterminals from N_i and ignores the rest. More formally, for every i, 1 ≤ i ≤ d, we define a morphism $\widehat {\mathsf {D}}_{i}\colon (N\cup {\Sigma })^{*}\rightarrow (N\cup {\Sigma })^{*}$ component-wise by $\widehat {\mathsf {D}}_{i}(x) = \mathsf {D}(x)$, if x ∈ N_i and $\widehat {\mathsf {D}}_{i}(x) = x$, otherwise. Using these morphisms, we now inductively define the levelsL_i, 0 ≤ i ≤ d, of G by L_d = ax and, for every i, 0 ≤ i ≤ d − 1, $\mathsf {L}_{i} = \widehat {\mathsf {D}}_{i+1}(\mathsf {L}_{i + 1})$.

Observation 4

The sequence L_d, L_d− 1,…, L₁, L₀ is a derivation with L_d = ax, L₀ = w and, by a simple induction over i, it can be verified that, for every i, 1 ≤ i ≤ d, all applications of rules for nonterminals from N_i happen in the single derivation step from L_i to L_i− 1. In particular, this implies that, for every i, 1 ≤ i ≤ d, L_i contains all occurrences of nonterminals A ∈ N_i that are ever derived in the derivation of w or, in other words, for every j, 0 ≤ j ≤ i − 1, ${\sum }_{A \in N_{i}} |\mathsf {L}_{j}|_{A} = 0$.

Since in the derivation L_d, L_d− 1,…, L₁, L₀ occurrences of a nonterminal A are not derived until all of them are collected in L_i and then they are derived all at once in the same derivation step, we can conveniently define the term profit for all rules (of the d-level grammar G) as follows. For every i, 1 ≤ i ≤ d, we define the profit of every A ∈ N_i by p(A) = |L_j|_A(|D(A)|− 1) −|D(A)|. Note that for d = 1 this corresponds to the definition of profit for 1-level grammars as introduced on page 11. In particular, we can now express the size of a grammar in terms of the profit of its rules:

Proposition 7

Let G be a grammar. Then $|G| = |{w}|-({\sum }^{d}_{i = 1}{\sum }_{A\in N_{i}} \mathsf {p}(A))$.

Proof

We recall that, by definition of the size of a grammar and as a conclusion of Observation 4, we have

$$ \begin{array}{@{}rcl@{}} |G| = \left( \sum\limits_{i = 1}^{d} \sum\limits_{A \in N_{i}} |\mathsf{D}(A)| \right) + |\mathsf{ax}| , |{w}| = \left( \sum\limits_{i = 1}^{d} \sum\limits_{A \in N_{i}} |\mathsf{L}_{i}|_{A} (|\mathsf{D}(A)| - 1)\right) + |\mathsf{ax}| . \end{array} $$

Consequently,

$$ \begin{array}{@{}rcl@{}} &&|{w}|-\left( \sum\limits_{i = 1}^{d}\sum\limits_{A\in N_{i}}\! \mathsf{p}(A)\! \right)\! = |{w}| - \left( \sum\limits_{i = 1}^{d} \sum\limits_{A \in N_{i}} |\mathsf{L}_{i}|_{A} (|\mathsf{D}(A)| - 1) - |\mathsf{D}(A)|\right) = \\ &&|{w}| - \left( \left( \sum\limits_{i = 1}^{d} \sum\limits_{A \in N_{i}} |\mathsf{L}_{i}|_{A} (|\mathsf{D}(A)| - 1)\right) - \left( \sum\limits_{i = 1}^{d} \sum\limits_{A \in N_{i}} |\mathsf{D}(A)|\right)\right) = \\ &&|{w}| - \left( (|{w}| - |\mathsf{ax}|) - (|G| - |\mathsf{ax}|)\right) = |G| . \end{array} $$

□

Example 2

Let G = (N,Σ, R, ax) with N = {A, B, C, D}, Σ = {,}, R = {A → D, B →, C → AB, D →} and ax = CDC be the 3-level grammar illustrated in Fig. 4. According to the definitions from above, the partition of N is N₁ = {B, D}, N₂ = {A}, N₃ = {C}, and the levels are

$$ \begin{array}{@{}rcl@{}} \begin{array}{ll} \mathsf{L}_{3}= \mathsf{ax} & \qquad =CDC ,\\ \mathsf{L}_{2} = \widehat{\mathsf{D}}_{3}(CDC) & \qquad = A B D A B ,\\ \mathsf{L}_{1} = \widehat{\mathsf{D}}_{2}(A B D A B) & \qquad = D {\mathtt{b}} {\mathtt{b}} B D D {\mathtt{b}} {\mathtt{b}} B ,\\ \mathsf{L}_{0} = \widehat{\mathsf{D}}_{1}(D {\mathtt{b}} {\mathtt{b}} B D D {\mathtt{b}} {\mathtt{b}} B) & \qquad = {\mathtt{a}} {\mathtt{a}} {\mathtt{a}} {\mathtt{b}} {\mathtt{b}} {\mathtt{a}} {\mathtt{b}} {\mathtt{a}} {\mathtt{a}} {\mathtt{a}} {\mathtt{a}} {\mathtt{a}} {\mathtt{a}} {\mathtt{b}} {\mathtt{b}} {\mathtt{a}} {\mathtt{b}} . \end{array} \end{array} $$

Note that, for every i, 1 ≤ i ≤ 3, L_i contains all occurrences of all nonterminals from N_i and the rules for all nonterminals N_i are exclusively applied in deriving L_i− 1 from L_i. In particular, note that in the derivation L₃,…, L₀, the derivation of occurrences of nonterminals B and D is delayed until the very last derivation step.

Furthermore, the profits are as follows

$$ \begin{array}{@{}rcl@{}} \mathsf{p}(A) &=& |\mathsf{L}_{2}|_{A} (|\mathsf{D}(A)| - 1) - |\mathsf{D}(A)| = 2 (3 - 1) - 3 = 1, \\ \mathsf{p}(B) &=& |\mathsf{L}_{1}|_{B} (|\mathsf{D}(B)| - 1) - |\mathsf{D}(B)| = 2 (2 - 1) - 2 = 0, \\ \mathsf{p}(C) &=& |\mathsf{L}_{3}|_{C} (|\mathsf{D}(C)| - 1) - |\mathsf{D}(C)| = 2 (2 - 1) - 2 = 0, \\ \mathsf{p}(D) &=& |\mathsf{L}_{1}|_{D} (|\mathsf{D}(D)| - 1) - |\mathsf{D}(D)| = 3 (3 - 1) - 3 = 3 . \end{array} $$

Moreover, $|{w}|-{\sum }_{A\in N} \mathsf {p}(A) = 17 - 4 = 13$ and |G| = |ax| + |D(A)| + |D(B)| + |D(C)| + |D(D)| = 3 + 3 + 2 + 2 + 3 = 13.

Before we formally present the dynamic programming algorithm, we sketch its behaviour in a more intuitive way. We first need the following definition. A factorisation p = (u₁, u₂,…, u_k) is a refinement of a factorisation q = (v₁, v₂,…, v_m), denoted by p ≼ q, if $(u_{j_{i - 1} + 1}, u_{j_{i - 1} + 2}, \ldots , u_{j_{i}})$ is a factorisation of v_i, 1 ≤ i ≤ m, for some {j_i}_0≤i≤m, with 0 = j₀ < j₁ < … < j_m = k.

The algorithm runs through steps $i = 1, 2, \ldots , \frac {w}{2}$ and in step i, it considers all possibilities for two factorisations q_i− 1 and q_i of w induced by L_i− 1 and L_i, respectively (note that this implies q_i− 1 ≼ q_i). The differences between q_i− 1 and q_i implicitly define N_i as follows. Let $q_{i}=(v_{1},v_{2},\dots ,v_{k})$ and let q_i− 1 = (u₁, u₂,…, u_ℓ), which, since q_i− 1 ≼ q_i, means that for some j_i, 0 ≤ i ≤ k, with 1 = j₀ < j₁ < … < j_k = ℓ + 1, $(u_{j_{i - 1}}, u_{j_{i - 1} + 1}, \ldots , u_{j_{i} - 1})$ is a factorisation of v_i, 1 ≤ i ≤ k. If j_s − j_s− 1 > 1 for some 1 ≤ s ≤ k, N_i contains a nonterminal A with |D(A)| = j_s − j_s− 1 and $\mathfrak {D}(A)=v_{s}$. The number |L_i|_A is also implicitly given by counting how often the sequence of factors $(u_{j_{s-1}+1},\dots ,u_{j_{s}})$ independently occurs in q_i− 1 and is combined into one single factor in q_i; more precisely, $|\mathsf {L}_{i}|_{A} = |\{t\colon (u_{j_{t-1}+1},\dots ,u_{j_{t}})=(u_{j_{s-1}+1},\dots ,u_{j_{s}})\}|$. This allows to calculate the profit of the rule for A without knowing the exact structure of the rules for nonterminals in N_j with j≠i. By Lemma 13, this choice of nonterminals for N_i is optimal for the fixed induced factorisations, which means that a search among all choices for q_i− 1 and q_i yields a smallest i-level grammar for w. The running time of this algorithm is dominated by enumerating all pairs q_i− 1 and q_i of factorisations of w. However, due to q_i− 1 ≼ q_i, these pairs can be compressed as vectors {0,1,2}^|w|− 1 (the entries denote whether the corresponding position in w is factorised by both (entry ‘1’), only by the refinement (entry ‘2’) or none (entry ‘0’) of the factorisations). Hence, enumerating these pairs of vectors can be done in time $\mathcal {O}(3^{|{w}|})$.

Theorem 13

SGP can be solved in time and space $\mathcal {O}^{*}(3^{|{w}|})$.

Proof

Let n = |w|. We use dynamic programming to consider all possible factorisations of w and refinements for each level $i=1,\dots ,d$. A factorisation of w is stored as a vector q ∈{0,1}^n− 1 and, furthermore, we use vectors q ∈{0,1,2}^n− 1 in order to represent a factorisation together with a refinement, as explained above (for the sake of convenience, we implicitly assume q[0] = q[n] = 1). For such a vector q ∈{0,1,2}^n− 1 that describes two factorisations p and $p^{\prime }$ with $p \preceq p^{\prime }$, we denote by F(q) the factorisation $p^{\prime }$ (represented as a vector from {0,1}^n− 1) and by R(q) the refinement p (represented as a vector from {0,1}^n− 1). More formally, let $F \colon \{0,1,2\}^{n-1}\rightarrow \{0,1\}^{n-1}$ be a mapping that replaces each ‘2’-entry by a ‘0’-entry (and leaves all other entries unchanged), and let $R \colon \{0,1,2\}^{n-1}\rightarrow \{0,1\}^{n-1}$ be a mapping that replaces each ‘2’-entry by a ‘1’-entry (and leaves all other entries unchanged).

The dynamic program uses the following tables:

T[i, q] for $i\in \{2,\dots , \frac {n}{2}\}$ and all q ∈{0,1,2}^n− 1 ∖{0,1}^n− 1 stores the size of a smallest i-level grammar for w for which the axiom ax induces the factorisation F(q) and for which $\widehat {\mathsf {D}}_{i}(\mathsf {ax})$ induces the factorisation R(q).
S[i, q] for all $i\in \{1,\dots ,\frac {n}{2}\}$ and all q ∈{0,1}^n− 1 stores the size of a smallest i-level grammar for w for which the axiom induces the factorisation q.
P[i, q] for all $i\in \{2,\dots , \frac {n}{2}\}$ and all q ∈{0,1}^n− 1 stores the refinement of q which equals the factorisation induced by $\widehat {\mathsf {D}}_{i}(\mathsf {ax})$ for an optimal i-level grammar for which ax induces factorisation q.
opt_i for all $i\in \{1,\dots ,\frac {n}{2}\}$ stores the value of a smallest i-level grammar for w.

We point out that the tables T and S are sufficient to compute the size of a smallest grammar; the purpose of table P is to construct an actual grammar of minimal size after termination of the algorithm. Intuitively speaking, in order to determine S[i, q], i. e., the size of a smallest i-level grammar for which the axiom induces the factorisation q, we have to check all entries $T[i, q^{\prime }]$ for which the factorisation of $q^{\prime }$ (note that $q^{\prime }$ represents a factorisation and a refinement) equals q and for a minimal one of these entries, we store the actual refinement (which is not needed anymore to compute the size of a minimal grammar) in P[i, q]. In this way, the entries of P[i, q] allow us to restore an actual smallest grammar.

We first initialise S by setting S[1, q] = |G_q|, for every q ∈{0,1}^n− 1, where, according to Lemma 13, G_q is a smallest 1-level grammar for w that induces factorisation q, and we set $opt_{1} = \min \limits \{S[1,q]\mid q\in \{0,1\}^{n-1}\}$.

We then compute iteratively for each $i=2,\dots , \frac {n}{2}$ the entries T[i, q], $S[i, q^{\prime }]$ and $P[i, q^{\prime }]$, for every q ∈{0,1,2}^n− 1 ∖{0,1}^n− 1 and $q^{\prime }\in \{0,1\}^{n-1}$ as follows.

First, for any q ∈{0,1,2}^n− 1 ∖{0,1}^n− 1, we define the set I(q) of consecutive factors in R(q) which are combined into one factor in F(q):

$$ \begin{array}{@{}rcl@{}} I(q):=\{(j_{0},j_{1},\dots,j_{k}) \mid & & |q[j_{0} - 1..j_{k}]|_{1} = |q[j_{0} - 1]q[j_{k}]|_{1} = 2,\\ && |q[j_{0}..j_{k}]|_{2} = |q[j_{1}]{\ldots} q[j_{k-1}]|_{2} = k-1 \geq 1\} . \end{array} $$

Furthermore, from I(q), we can extract the set N(q) of nonterminals which create these factors on level i, i. e., $N(q):=\{w(j_{0},j_{1},\dots ,j_{k})\mid (j_{0},\dots ,j_{k})\in I(q)\}$, where

$$ \begin{array}{@{}rcl@{}} w(j_{0},j_{1},\dots,j_{k}):=(w[j_{0}+1.. j_{1}],w[j_{1}+1.. j_{2}],\dots,w[j_{k-1}+1 .. j_{k}]) . \end{array} $$

The corresponding number of occurrences of the nonterminal $w(j_{0},j_{1},\dots ,j_{k})$ on level i is given by

$$c(j_{0},j_{1},\dots,j_{k}){}:={}|\{(j^{\prime}_{0},j^{\prime}_{1},\dots,j^{\prime}_{k}){}\in{} I(q){}\mid{} w(j_{0},j_{1},\dots,j_{k}){}={}w(j^{\prime}_{0},j^{\prime}_{1},\dots,j^{\prime}_{k})\}| .$$

The entry T[i, q] can now be computed as follows:

$$T[i,q]=S[i-1,R(q)]-\left( \sum\limits_{w(j_{0},j_{1},\dots,j_{k})\in N(q)} c(j_{0},j_{1},\dots,j_{k})(k-1)-k\right)$$

Then, for every $q^{\prime }\in \{0,1\}^{n-1}$, we can compute entries $S[i,q^{\prime }]$ and $P[i,q^{\prime }]$ by

$$ \begin{array}{@{}rcl@{}} S[i,q^{\prime}] &=&\min\{T[i, q]\mid F(q)=q^{\prime}\} \text{ and}\\ P[i,q^{\prime}] &=& q , \end{array} $$

where q ∈{0,1,2}^n− 1 ∖{0,1}^n− 1 with $F(q)=q^{\prime }$ and $T[i,q]=S[i,q^{\prime }]$. Finally, the value opt_i is computed by $opt_{i} = \min \limits \{S[i,q^{\prime }]\mid q^{\prime }\in \{0,1\}^{n-1}\}$.

After termination of step $\frac {n}{2}$, the size of a smallest grammar for the word w is $\min \limits \{opt_{i} \mid 1 \leq i \leq \frac {n}{2}\}$. Since the values in T[i, q] for any $i=2,3,\dots , \frac {n}{2}$ and q ∈{0,1,2}^n− 1 ∖{0,1}^n− 1 are constructively computed from S[i, R(q)] by defining the rules in N(q), the set $\bigcup _{j=1}^{i} N(q_{i})$ with q_i := q and q_j− 1 := P[j, q_j] for $j=i-1,\dots ,1$ yields an i-level grammar for w of size T[i, q]. For the index i with $opt_{i} = \min \limits \{opt_{i} \mid 1 \leq i \leq \frac {n}{2}\}$ and a vector q ∈{0,1,2}^n− 1 ∖{0,1}^n− 1 such that opt_i = S[i, R(q)], this construction gives a smallest grammar for w.

In order to prove the correctness of the algorithm, we show for each q ∈{0,1}^n− 1, inductively for each $i=1,\dots ,\frac {n}{2}$ that S[i, q] equals the size of a smallest i-level grammar for w which induces the factorisation q. For i = 1 this is implied by Lemma 13. Assuming that this statement is true for some value i − 1, let G_i = (N,Σ, R, ax) be a smallest i-level grammar for w with $i \leq \frac {n}{2}$. Let q_i and q_i− 1 be the vector-representations of the factorisations induced by ax and $\widehat {\mathsf {D}}_{i}(\mathsf {ax})$ respectively. The grammar $G_{i-1}:=(N\setminus N_{i},{\Sigma },R\setminus \{(A,\mathsf {D}(A))\mid A\in N_{i}\},\widehat {\mathsf {D}}_{i}(\mathsf {ax}))$ is an (i − 1)-level grammar for w with induced factorisation q_i− 1 and the size of G_i− 1 can be computed by $|G_{i}|+{\sum }_{A\in N_{i}} \mathsf {p}(A)$ and is at least S[i − 1, q_i− 1] by the induction hypothesis. By definition of the profit, the term $|G_{i}|+{\sum }_{A\in N_{i}} \mathsf {p}(A)$ can be re-written to $|G_{i}|+|\widehat {\mathsf {D}}_{i}(\mathsf {ax})|-|\mathsf {ax}| -{\sum }_{A\in N_{i}}|\mathsf {D}(A)|$.

Let q ∈{0,1,2}^n− 1 be such that F(q) = q_i and R(q) = q_i− 1, i. e., for every j, 1 ≤ j ≤ n − 1, q[j] = 2, if q_i[j]≠q_i− 1[j] and q[j] = q_i[j], otherwise. The value T[i, q] is computed from S[i − 1, q_i− 1] by subtracting

$$ \begin{array}{@{}rcl@{}} &&\sum\limits_{w(j_{0},j_{1},\dots,j_{k})\in N(q)} c(j_{0},j_{1},\dots,j_{k})(k-1)-k = \\ &&\left( \sum\limits_{(j_{0},\dots,j_{k})\in I(q)}(k-1)\right) - \left( \sum\limits_{w(j_{0},\dots,j_{k})\in N(q)}k\right) . \end{array} $$

Each 2-entry in q occurs in exactly one set in I(q) which, by definition of q, yields:

$$\sum\limits_{(j_{0},j_{1},\dots,j_{k})\in I(q)}(k-1)= {\sum}_{j=1}^{n-1}(q_{i-1}[j]-q_{i}[j]) = |\widehat{\mathsf{D}}_{i}(\mathsf{ax})|-|\mathsf{ax}| .$$

For each $w(j_{0},j_{1},\dots ,j_{k})\in N(q)$, N_i contains a nonterminal A ∈ N_i with |D(A)| = k, which means that ${\sum }_{A\in N_{i}}|\mathsf {D}(A)|\geq {\sum }_{w(j_{0},j_{1},\dots ,j_{k})\in N(q)}k$; thus,

$$ \begin{array}{@{}rcl@{}} |G_{i}|&=&|G_{i-1}|-|\widehat{\mathsf{D}}_{i}(\mathsf{ax})|+|\mathsf{ax}| +\sum\limits_{A\in N_{i}}|\mathsf{D}(A)|\\ &\geq& S[i-1,q_{i-1}] - \sum\limits_{w(j_{0},j_{1},\dots,j_{k})\in N(q)} c(j_{0},j_{1},\dots,j_{k})(k-1)-k\\ &=&T[i,q]\geq S[i,F(q)]=S[i,q_{i}] . \end{array} $$

Consequently, the algorithm computes the size of a grammar for w that is smallest among all grammars for w with at most $\frac {n}{2}$ levels and since for any word w there always exists a smallest grammar with at most $\frac {|{w}|}{2}$ levels, we conclude that the described algorithm finds a smallest grammar for w. □

We conclude this section by pointing out some features of the algorithm of Theorem 13. First, note that the brute-force enumeration of all q ∈{0,1,2}^n− 1 ∖{0,1}^n− 1, which dominates the running-time, provides some possibilities for modifications. For example, if we only consider q such that at most 2 neighbouring factors of R(q) are combined in F(q) (which are much less than the full set {0,1,2}^n− 1 ∖{0,1}^n− 1), then we automatically compute smallest grammars in Chomsky normal form.^{Footnote 11} Moreover, for a fixed i and two $q_{1}, q_{2} \in \{0,1, 2\}^{n-1}\setminus \{0,1\}^{n-1}$, the computations that are necessary to compute T[i, q₁] and T[i, q₂] are independent from each other and only require the previously computed values S[i − 1,⋅] (an analogous observation can be made for the computation of the S[i,⋅] and P[i,⋅]). Hence, the brute-force enumeration of the q ∈{0,1,2}^n− 1 ∖{0,1}^n− 1 and of the $q^{\prime } \in \{0,1\}^{n-1}$ can be easily done in parallel.

6 Conclusions

We conclude this work by discussing some important open problems and additional questions that are motivated by our results.

6.1 Small Alphabets

For hard problems on strings, we usually encounter the situation that either the problem becomes polynomial-time solvable for constant alphabets, or there is a hardness reduction that works for some constant alphabet, which, by simple encoding techniques, extends to binary alphabets as well. Moreover, the unary case is often trivially solvable in polynomial time, even if the problem becomes intractable for larger alphabets. However, the smallest grammar problem shows a drastically different behaviour: it is not polynomial-time solvable for every constant alphabet (unless P = NP), but the NP-hardness for very small alphabets (even for the binary or unary case) is still open. Thus, we consider the following as one of the most important open questions:

Open Problem 1

Is it possible to compute smallest grammars for binary alphabets in polynomial time?

We believe that answering this question in the negative might be rather difficult. In fact, the substantial effort that was necessary to prove Theorem 3 suggests that further strengthening our reduction to the case of binary alphabets is problematic. Thus, a completely different kind of reduction seems necessary. However, the main technical challenge seems to be the necessity to control the compression of factors that function as codewords for parts of the source problem of the reduction. It is arguably difficult to think about reductions that somehow circumvents this issue.

On the other hand, it is not apparent how a small alphabet could help in order to efficiently compute smallest grammars and, if this is possible, it seems that deeper combinatorial insights with respect to grammar-based compression are necessary.

6.2 Approximation

So far, no constant-factor approximation algorithm is known for the smallest grammar problem (as already mentioned in Section 1.3, the best approximation algorithms achieve a ratio in $\mathcal {O}\left (\log \left (\frac {|{w}|}{m^{*}}\right )\right )$ [33, 34, 40]) and, although not backed by any hardness results, the existing literature suggests that no such algorithm exists. Moreover, this apparent hardness of approximating smallest grammars also applies to the case of fixed alphabets, since, as shown in [39], if there is an approximation algorithm for the smallest grammar problem over a binary alphabet with a constant approximation ratio c, then there also is a 6c-approximation algorithm for arbitrary alphabets. This especially means that disproving the existence of a 6-approximation for the smallest grammar problem for unbounded alphabets, under some complexity theoretic assumption, implies, under the same assumption, that there is no polynomial algorithm for the restriction to binary alphabets. Considering the substantial effort that went into designing a reduction for alphabet size 17 in this paper, such an inapproximability result for unbounded alphabets might actually be an easier way to show computational lower bounds for binary alphabets.

Aside from these consequences for binary alphabets, an inapproximability result (with some ratio significantly larger than the current bound of $\frac {8569}{8568}$) for the smallest grammar problem would be very interesting, yet not unexpected. The common belief that general constant-factor approximations probably do not exist is based on the fact that, despite substantial effort, such algorithms have not been found so far, but also on the close relation to the problem of computing shortest addition chains for a set of integers — a problem which has been extensively studied for over 100 years (see [63] for a survey on addition chains and [33, 34] for their connections to the smallest grammar problem). Formally, an addition chain is a strictly increasing sequence $(a_{1}, a_{2}, \ldots , a_{k}) \in \mathbb {N}^{k}$ with a₁ = 1 and, for every i, 2 ≤ i ≤ k, there are b, c ∈{a₁,…, a_i− 1} with a_i = b + c; the task is to compute a desirably short addition chain that contains a given set of integers. In a sense, grammars can be seen as the natural extension of addition chains (i. e., instead of integers, we are concerned with strings and integer-addition becomes string-concatenation).

It has been shown in [33, 34], that a set of integers can be translated into a word (over an alphabet that grows with the number of integers), the smallest grammar of which is larger than the length of a shortest addition chain of the integers by only a constant factor. Consequently, an approximation algorithm for the smallest grammar problem with approximation ratio in $\small \text {o}(\frac {\log n}{\log \log n})$ would imply an improvement of long-standing results for addition chains, for which the best known approximation algorithm achieves an approximation ratio in $\mathcal {O}(\frac {\log n}{\log \log n})$ (see [34] for details). Note that, with the results of [39] mentioned above, this statement also holds for the case of constant, even binary, alphabets.

Moreover, we can also observe that the fundamental technique of the approximation algorithms of [33, 34, 40], which links smallest grammars with the size of LZ77-factorisations, is unlikely to prove an approximation with ratio in $\small \text {o}(\frac {\log n}{\log \log n})$. More precisely, by bounding the size of a smallest grammar of a word from below by the length of its shortest LZ77-factorisation, the performance of these algorithms is shown by comparison with this LZ77-bound. However, it is also shown (see [33, 40]) that there are words, for which a smallest grammar is $\mathcal {O}(\frac {\log n}{\log \log n})$-times as large as the size of a smallest LZ77-factorisation; thus, for such algorithms, an approximation-ratio better than $\mathcal {O}(\frac {\log n}{\log \log n})$ cannot be shown by this technique. Moreover, note that this result is improved in [39], where binary words are presented, for which a smallest grammar is $\mathcal {O}(\frac {\log n}{\log \log n})$-times as large as the size of a smallest LZ77-factorisation.

Open Problem 2

Is there a constant-factor approximation algorithm for the smallest grammar problem? (Note that a negative result disproving a ratio of 6 or larger, yields a bound for the restriction to binary alphabets.)

6.3 Parameterised Complexity

This work can also be seen as the starting point of a comprehensive parameterised complexity analysis of the smallest grammar problem. More precisely, our results show that the problem is most likely not in FPT, if parameterised by |Σ|, |N| or the number of levels. However, with respect to parameter |N|, we saw that it is at least in XP. A simple fixed-parameter tractable case can be obtained, if we parameterise by both |Σ| and $\ell = \max \limits \{|\mathfrak {D}(A)| \mid A \in N\}$. More precisely, for every $F \subseteq \{u \mid u \in {\Sigma }^{+}, 2 \leq |u| \leq \ell \}$, we compute a smallest F-grammar according to Lemma 11 and we output one that is minimal among them. Since the number of the sets F is bounded by a function of the parameters, this yields an fpt-algorithm. However, we consider the following parameterised variant, for which the existence of an fpt-algorithm is still open, the most interesting:

Open Problem 3

Is the smallest grammar problem parameterised by |Σ| and |N| fixed-parameter tractable?

6.4 A More Abstract View

From a rather abstract point of view, one could generally interpret any set of factors $F \subseteq 2^{{\Sigma }^{*}}$ as a grammar. More precisely, an F-grammar is then a triple G_F = (N,Σ, R) (the axiom or start symbol is intentionally missing) with N = {A_u∣u ∈ F} and R is a set of rules over Σ and N that satisfies $\mathfrak {D}(A_{u}) = u$, for every u ∈ F. In this way, an F-grammar is a representation of F (just that none of the words in F is the designated compressed word). Obviously, there is a large element of freedom in this definition of F-grammars, since many choices for R are possible. However, as long as we are only interested in small grammars, this is justified, since a grammar that is a smallest among all F-grammars (in the sense described above) can be computed in polynomial time. To see this, we can slightly adapt the approach from Section 4 as follows. For every u ∈ F, we first construct the subgraph with vertices V_{4, u} and edges E_{4, u}, then we delete all vertices (u, i, j) with i < j and u[i..j]∉F (and adjacent edges). As before, it can be shown that an independent dominating set for the resulting interval graph corresponds to a smallest F-grammar. In the following, we denote by G_F the smallest F-grammar obtained in this way.

In a sense, this abstracts away the question of how factors are compressed by other factors and boils the problem of computing small grammar down to its core of hardness, which relies in choosing the right factors. While this perspective is interesting from a theoretical point of view, it also yields questions that might have algorithmic application. For example, as an alternative to the exponential brute-force enumeration of all $F \subseteq \mathsf {F}_{\geq 2}(w)$ in order to obtain an F-grammar that is smallest among all grammars, one could compute G_F for a factor set F that is inclusion maximal in the sense that, for every $F^{\prime } \supsetneq F$, $|G_{F}| < |G_{F^{\prime }}|$ (or inclusion minimal, which can be defined analogously). However, this approach only seems applicable in a reasonable way, if this concept of inclusion maximality is monotone, i. e., the inclusion maximality of F is characterised by |G_F| < |G_(F∪{u})|, for every u ∈Σ^∗. In this regard, note that $|G_{F}| = |G_{F^{\prime }}|$ is possible for $F \subsetneq F^{\prime }$, as witnessed by F = {⁴} and F = {⁴,²}.

Open Problem 4

Are there $F_{1} \subsetneq F_{2} \subsetneq F_{3} \subseteq \mathsf {F}_{\geq 2}(w)$, such that $|G_{F_{1}}| < |G_{F_{2}}|$ and $|G_{F_{3}}| < G_{F_{1}}|$?

If the inclusion maximality is monotone, then every inclusion maximal F (thus, also an optimal F for which G_F is a smallest grammar) can be computed by starting with F = {w} and iteratively adding factors from w, until every possible new factor would increase the size of G_F. This also yields an obvious greedy strategy: always choose the new factor that results in a smallest G_F. In this regard, we stress the fact that this kind of greedy strategy differs from the algorithm Greedy [37], analysed in [33, 34], since the latter iteratively changes an existing grammar and the greediness is with respect to the rules of the intermediate grammars.

This also points out an interesting fact (and a potential difficulty) of this approach: The grammars corresponding to the factor sets F, F ∪{u}, $F \cup \{u, u^{\prime }\}$ and so on, i. e., the grammars G_F, G_(F∪{u}), etc., could be quite different and do not necessarily share the incremental character of the factor sets, in the sense that one grammar can be obtained from the previous one by small, local modifications.

Notes

Such context-free grammars are also called straight-line programs in the literature.
In this work, the term “compression” always refers to lossless data compression.
The work [2] also considers the compression perspective.
There is a Dagstuhl seminar series concerned with algorithmics on compressed sequences that so far took place in 2008 [12], 2013 [13] and 2016 [14].
A concept of grammar complexity has also been introduced and is investigated in the area of descriptional complexity of formal languages (see [27,28,29,30,31,32]). However, this differs from the topic of this paper, since there, grammars for finite languages are investigated and the complexity measure under interest is the number of rules (note that in [31, 32], the size of grammars is also considered).
Most of these algorithms were originally designed as compression algorithms (with slightly different purposes than solving the smallest grammar problem), but they can also be regarded as approximation algorithms for the smallest grammar problem and have also been investigated in this regard in [33, 34].
As we are not considering maximisation problems, we define the relevant terminology only for minimisation problems.
The report can be downloaded at http://www.informatik.uni-trier.de/∼fernau/Sto77.pdf.
For example, a grammar can be formed into a single string by using an order on the rules and then listing the right sides with separators in between, or by listing the rules with the corresponding nonterminals.
See page 8 for the definition of the closed neighbourhood.
The restriction to grammars in Chomsky normal form is quite common, since also many of the existing approximation algorithms compute grammars in Chomsky normal form.

References

Nevill-Manning, C.G., Witten, I.H.: Identifying hierarchical structure in sequences: A linear-time algorithm. J. Artif. Intell. Res. 7, 67–82 (1997)
MATH Google Scholar
Nevill-Manning, C.G.: Inferring sequential structure, Ph.D. Thesis, University of Waikato, NZ (1996)
de Marcken, C.: Unsupervised language acquisition. Ph.D. Thesis, Department of Electrical Engineering and Computer Science, MIT, USA (1996)
Gallé, M: Searching for compact hierarchical structures in DNA by means of the smallest grammar problem. Ph.D. Thesis, University of Rennes 1, France (2011)
Lanctôt, J.K., Li, M., Yang, E.: Estimating DNA sequence entropy. In: Proceedings of the eleventh annual ACM-SIAM symposium on discrete algorithms, SODA 2000, january 9-11, 2000, san francisco, ca, USA., pp. 409–418 (2000)
Kieffer, J.C., Yang, E.-H.: Grammar-based codes: A new class of universal lossless source codes. IEEE Trans. Inf. Theory 46(3), 737–754 (2000)
MathSciNet MATH Google Scholar
Kieffer, J.C., Yang, E.-H., Nelson, G.J., Cosman, P.C.: Universal lossless compression via multilevel pattern matching. IEEE Trans. Inf. Theory 46(4), 1227–1245 (2000)
MathSciNet MATH Google Scholar
Yang, E.-H., Kieffer, J.C.: Efficient universal lossless data compression algorithms based on a greedy sequential grammar transform - part one: Without context models. IEEE Trans. Inf. Theory 46(3), 755–777 (2000)
MATH Google Scholar
Storer, J.A., Szymanski, T.G.: Data compression via textual substitution. Journal of the ACM 29(4), 928–951 (1982)
MathSciNet MATH Google Scholar
Storer, J.A.: NP-completeness results concerning data compression. Tech. Rep. Dept. 234, Electrical Engineering and Computer Science, Princeton University, USA (1977)
Li, M., Vitányi, P: An introduction to Kolmogorov complexity and its applications, 2nd edn. Springer, Berlin (1997)
MATH Google Scholar
Böttcher, S, Lohrey, M., Maneth, S., Rytter, W.: 08261 abstracts collection - structure-based compression of complex massive data. In: Structure-Based Compression of Complex Massive Data, 22.06. - 27.06.2008. http://drops.dagstuhl.de/opus/volltexte/2008/1694/ (2008)
Maneth, S., Navarro, G.: Indexes and computation over compressed structured data (dagstuhl seminar 13232). Dagstuhl Reports 3(6), 22–37 (2013). https://doi.org/10.4230/DagRep.3.6.22
Article Google Scholar
Bille, P., Lohrey, M., Maneth, S., Navarro, G.: Computation over compressed structured data (dagstuhl seminar 16431). Dagstuhl Reports 6(10), 99–119 (2016). https://doi.org/10.4230/DagRep.6.10.99
Article Google Scholar
Lohrey, M.: Algorithmics on SLP-compressed strings: A survey. Groups, Complexity, Cryptology 4(2), 241–299 (2012)
MathSciNet MATH Google Scholar
Lohrey, M.: The compressed word problem for groups, Springer Briefs in Mathematics. Springer, Berlin (2014)
MATH Google Scholar
Akutsu, T.: A bisection algorithm for grammar-based compression of ordered trees. Inf. Process. Lett. 110(18-19), 815–820 (2010)
MathSciNet MATH Google Scholar
Lohrey, M., Maneth, S.: The complexity of tree automata and XPath on grammar-compressed trees. Theor. Comput. Sci. 363(2), 196–210 (2006)
MathSciNet MATH Google Scholar
Lohrey, M., Maneth, S., Mennicke, R.: XML tree structure compression using RePair. Inf. Syst. 38(8), 1150–1167 (2013)
Google Scholar
Lohrey, M., Maneth, S., Schmidt-Schauß, M.: Parameter reduction and automata evaluation for grammar-compressed trees. J. Comput. Syst. Sci. 78(5), 1651–1669 (2012)
MathSciNet MATH Google Scholar
Gascón, A., Lohrey, M., Maneth, S., Reh, C.P., Sieber, K.: Grammar-based compression of unranked trees. In: Computer Science - Theory and Applications - 13th International Computer Science Symposium in Russia, CSR 2018, Moscow, Russia, June 6-10, 2018, Proceedings, pp. 118–131 (2018)
Gascón, A., Godoy, G., Schmidt-Schauß, M: Unification with singleton tree grammars. In: Rewriting Techniques and Applications, 20th International Conference, RTA 2009, Brasília, Brazil, June 29 - July 1, 2009, Proceedings, pp. 365–379 (2009)
Berman, P., Karpinski, M., Larmore, L.L., Plandowski, W., Rytter, W.: On the complexity of pattern matching for highly compressed two-dimensional texts. J. Comput. Syst. Sci. 65(2), 332–350 (2002)
MathSciNet MATH Google Scholar
Plandowski, W., Rytter, W.: Application of Lempel-Ziv encodings to the solution of words equations. In: Automata, Languages and Programming, 25th International Colloquium, ICALP 1998, Aalborg, Denmark, July 13-17, 1998, Proceedings, pp. 731–742 (1998)
Jez, A.: Recompression: A simple and powerful technique for word equations. J. ACM 63(1), 4:1–4:51 (2016)
MathSciNet MATH Google Scholar
Ganardi, M., Jez, A., Lohrey, M.: Balancing straight-line programs. In: 60th Annual Symposium on Foundations of Computer Science, FOCS ’19, Baltimore, Maryland, USA, November 9-12, 2019 (2019)
Alspach, B., Eades, P., Rose, G.: A lower-bound for the number of productions required for a certain class of languages. Discret. Appl. Math. 6(2), 109–115 (1983)
MathSciNet MATH Google Scholar
Filmus, Y.: Lower bounds for context-free grammars. Inf. Process. Lett. 111, 895–898 (2011)
MathSciNet MATH Google Scholar
Bucher, W., Maurer, H.A., II, K.C., Wotschke, D.: Concise description of finite languages. Theor. Comput. Sci. 14, 227–246 (1981)
Eberhard, S., Hetzl, S.: Compressibility of finite languages by grammars. In: Descriptional Complexity of Formal Systems - 17th International Workshop, DCFS 2015, Waterloo, ON, Canada, June 25-27, 2015. Proceedings, pp. 93–104 (2015)
Gruber, H., Holzer, M., Wolfsteiner, S.: On minimal grammar problems for finite languages. In: Developments in Language Theory - 22nd International Conference, DLT 2018, Tokyo, Japan, September 10-14, 2018, Proceedings, pp. 342–353 (2018)
Holzer, M., Wolfsteiner, S.: On the grammatical complexity of finite languages. In: Descriptional Complexity of Formal Systems - 20th IFIP WG 1.02 International Conference, DCFS 2018, Halifax, NS, Canada, July 25-27, 2018, Proceedings, pp. 151–162 (2018)
Charikar, M., Lehman, E., Liu, D., Panigrahy, R., Prabhakaran, M., Sahai, A., Shelat, A.: The smallest grammar problem. IEEE Trans. Inf. Theory 51(7), 2554–2576 (2005)
MathSciNet MATH Google Scholar
Lehman, E.: Approximation algorithms for grammar-based data compression. Ph.D. Thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology (2002)
Ziv, J., Lempel, A.: Compression of individual sequences via variable-rate coding. IEEE Trans. Inf. Theory 24(5), 530–536 (1978)
MathSciNet MATH Google Scholar
Welch, T.A.: A technique for high-performance data compression. IEEE Computer 17(6), 8–19 (1984)
Google Scholar
Apostolico, A., Lonardi, S.: Off-line compression by greedy textual substitution. Proceedings of the IEEE 88, 1733–1744 (2000)
Google Scholar
Larsson, N.J., Moffat, A.: Off-line dictionary-based compression. Proceedings of the IEEE 88, 1722–1732 (2000)
Google Scholar
Hucke, D., Lohrey, M., Reh, C.P.: The smallest grammar problem revisited. In: String Processing and Information Retrieval - 23rd International Symposium, SPIRE 2016, Beppu, Japan, October 18-20, 2016, Proceedings, pp. 35–49 (2016)
Rytter, W.: Application of Lempel-Ziv factorization to the approximation of grammar-based compression. Theor. Comput. Sci. 302(1-3), 211–222 (2003)
MathSciNet MATH Google Scholar
Arpe, J., Reischuk, R.: On the complexity of optimal grammar-based compression. In: 2006 data compression conference (DCC 2006), 28-30 march 2006, snowbird, ut, USA, pp. 173–182 (2006)
Garey, M.R., Johnson, D.S.: Computers and intractability. New York: Freeman, New York (1979)
MATH Google Scholar
Farber, M.: Independent domination in chordal graphs. Oper. Res. Lett. 1(4), 134–138 (1982)
MathSciNet MATH Google Scholar
Papadimitriou, C.H.: Computational complexity. Addison-Wesley, Boston (1994)
MATH Google Scholar
Downey, R.G., Fellows, M.R.: Fundamentals of parameterized complexity. Texts in Computer Science. Springer, Berlin (2013)
MATH Google Scholar
Flum, J., Grohe, M.: Parameterized complexity theory. Springer, Berlin (2006)
MATH Google Scholar
Cygan, M., Fomin, F., Kowalik, L., Lokshtanov, D., Marx, D., Pilipczuk, M., Pilipczuk, M., Saurabh, S.: Parameterized algorithms. Springer, Berlin (2015)
MATH Google Scholar
Ausiello, G.: Complexity and approximation: combinatorial optimization problems and their approximability properties. Springer, Berlin (1999)
MATH Google Scholar
Garey, M.R., Johnson, D.S., Stockmeyer, L.: Some simplified NP-complete graph problems. Theor. Comput. Sci. 1(3), 237–267 (1976)
MathSciNet MATH Google Scholar
Shannon, C.E.: A theorem on coloring the lines of a network. Journal of Mathematics and Physics 28, 148–151 (1949)
MathSciNet MATH Google Scholar
Skulrattanakulchai, S.: Δ-list vertex coloring in linear time. Inf. Process. Lett. 98(3), 101–106 (2006)
MathSciNet MATH Google Scholar
Alimonti, P., Kann, V.: Some APX-completeness results for cubic graphs. Theor. Comput. Sci. 237(1-2), 123–134 (2000)
MathSciNet MATH Google Scholar
Nevill-Manning, C.G., Witten, I.H.: On-line and off-line heuristics for inferring hierarchies of repetitions in sequences. Proceedings of the IEEE 88, 1745–1755 (2000)
Google Scholar
Benz, F., Kötzing, T: An effective heuristic for the smallest grammar problem. In: Genetic and Evolutionary Computation Conference, GECCO ’13, Amsterdam, The Netherlands, July 6-10, 2013, pp. 487–494 (2013)
Carrascosa, R., Coste, F., Gallé, M, López, G G I: Searching for smallest grammars on large sequences and application to DNA. Journal of Discrete Algorithms 11, 62–72 (2012)
MathSciNet MATH Google Scholar
Fournier, J.-C.: Colorations des arêtes d’un graphe. Cahiers Centre Études Recherche Opér. 15, 311–314 (1973). Colloque sur la Théorie des Graphes (Brussels, 1973)
MathSciNet MATH Google Scholar
Vizing, V.G.: The chromatic class of a multigraph. Kibernetika (Kiev) 1(3), 29–39 (1965)
MathSciNet Google Scholar
Manlove, D.F.: On the algorithmic complexity of twelve covering and independence parameters of graphs. Discret. Appl. Math. 91(1-3), 155–175 (1999)
MathSciNet MATH Google Scholar
Griggs, J.R., West, D.B.: Extremal values of the interval number of a graph. SIAM Journal on Matrix Analysis and Applications 1(1), 1–7 (1980)
MathSciNet MATH Google Scholar
Haynes, T.W., Hedetniemi, S.T., Slater, P.J.: Fundamentals of domination in graphs. Monographs and Textbooks in Pure and Applied Mathematics, vol. 208. Marcel Dekker, New York (1998)
Bourgeois, N., Croce, F.D., Escoffier, B., Paschos, V.T.: Fast algorithms for min independent dominating set. Discret. Appl. Math. 161(4-5), 558–572 (2013)
MathSciNet MATH Google Scholar
Downey, R.G., Fellows, M.R.: Fixed parameter tractability and completeness. Congressus Numerantium 87, 161–187 (1992)
MathSciNet MATH Google Scholar
Thurber, E.G.: Efficient generation of minimal length addition chains. SIAM Journal on Computing 28, 1247–1263 (1999)
MathSciNet MATH Google Scholar

Download references

Acknowledgments

Katrin Casel was supported by the Deutsche Forschungsgemeinschaft (FE 560/6-1). Serge Gaspers is the recipient of an Australian Research Council (ARC) Future Fellowship (FT140100048) and acknowledges support under the ARC’s Discovery Projects funding scheme (DP150101134). NICTA is funded by the Australian Government through the Department of Communications and the ARC through the ICT Centre of Excellence Program. We thank Gabriele Fici for pointing us to the “rule number” variant of the smallest grammar problem that we discussed at the end of Section 3.3.

Funding

Open Access funding enabled and organized by Projekt DEAL.

Author information

Authors and Affiliations

Hasso Plattner Institute, University of Potsdam, Potsdam, Germany
Katrin Casel
Fachbereich IV – Abteilung Informatikwissenschaften, Trier University, Trier, 54296, Germany
Henning Fernau
UNSW Australia, Data61 (formerly: NICTA), CSIRO, Sydney, Australia
Serge Gaspers
École Normale Superieure de Lyon, Département Informatique, Lyon, France
Benjamin Gras
Humboldt-Universität zu Berlin, Unter den Linden 6, 10099, Berlin, Germany
Markus L. Schmid

Authors

Katrin Casel
View author publications
You can also search for this author in PubMed Google Scholar
Henning Fernau
View author publications
You can also search for this author in PubMed Google Scholar
Serge Gaspers
View author publications
You can also search for this author in PubMed Google Scholar
Benjamin Gras
View author publications
You can also search for this author in PubMed Google Scholar
Markus L. Schmid
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Markus L. Schmid.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work represents an extended version of the paper “On the Complexity of Grammar-Based Compression over Fixed Alphabets” presented at the 43^rd International Colloquium on Automata, Languages, and Programming (ICALP 2016) and published in LIPIcs - Leibniz International Proceedings in Informatics (https://doi.org/10.4230/LIPIcs.ICALP.2016.122)

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Casel, K., Fernau, H., Gaspers, S. et al. On the Complexity of the Smallest Grammar Problem over Fixed Alphabets. Theory Comput Syst 65, 344–409 (2021). https://doi.org/10.1007/s00224-020-10013-w

Download citation

Accepted: 09 October 2020
Published: 13 November 2020
Issue Date: February 2021
DOI: https://doi.org/10.1007/s00224-020-10013-w

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

On the Complexity of the Smallest Grammar Problem over Fixed Alphabets

Abstract

Similar content being viewed by others

A really Simple Approximation of Smallest Grammar

On Minimal Grammar Problems for Finite Languages

The Smallest Grammar Problem Revisited

1 Introduction

1.1 Grammars as Inference Tools and Compressors

1.2 Algorithmics on Compressed Strings

1.3 The Smallest Grammar Problem

1.4 Our Contribution

1.5 Outline of the Paper

2 Preliminaries

2.1 Basic Concepts of Graph Theory and Complexity Theory

2.2 Grammars

Remark 1

Example 1

2.3 Examples

2.4 Storer and Szymanski’s External Pointer Macro Scheme and Grammar-Based Compression

3 N P-Hardness of Computing Smallest Grammars for Fixed Alphabets

Theorem 1

Proof

3.1 The 1-Level Case

Lemma 1

Proof

Lemma 2

Proof

Lemma 3

Proof

Theorem 2

3.2 The Multi-Level Case

Observation 1

Observation 2

Proposition 1

Proof

Proposition 2

Proof

Lemma 4

Lemma 5

Proof

Lemma 6

Proof

Lemma 7

Proof

Lemma 8

Proof

Theorem 3

3.3 Extensions of the Reductions

Theorem 4

Proof

Theorem 5

Observation 3

Theorem 6

Theorem 7

Theorem 8

Theorem 9

3.4 (Limits of) Alphabet Reduction

Corollary 1

Corollary 2

4 Smallest Grammars with a Bounded Number of Nonterminals

Lemma 9

Proof

Lemma 10

Proof

Proposition 3

Proof

Proposition 4

Proof

Proposition 5

Proof

Proposition 6

Proof

Lemma 11

Proof

Theorem 10

Proof

4.1 Related Questions

Lemma 12

Proof

Theorem 11