Algorithmica

, Volume 73, Issue 1, pp 217–233 | Cite as

Fast Algorithm for Partial Covers in Words

  • Tomasz Kociumaka
  • Solon P. Pissis
  • Jakub Radoszewski
  • Wojciech Rytter
  • Tomasz Waleń
Open Access
Article

Abstract

A factor \(u\) of a word \(w\) is a cover of \(w\) if every position in \(w\) lies within some occurrence of \(u\) in \(w\). A word \(w\) covered by \(u\) thus generalizes the idea of a repetition, that is, a word composed of exact concatenations of \(u\). In this article we introduce a new notion of \(\alpha \)-partial cover, which can be viewed as a relaxed variant of cover, that is, a factor covering at least \(\alpha \) positions in \(w\). We develop a data structure of \(\mathcal {O}(n)\) size (where \(n=|w|\)) that can be constructed in \(\mathcal {O}(n\log n)\) time which we apply to compute all shortest \(\alpha \)-partial covers for a given \(\alpha \). We also employ it for an \(\mathcal {O}(n\log n)\)-time algorithm computing a shortest \(\alpha \)-partial cover for each \(\alpha =1,2,\ldots ,n\).

Keywords

Cover of a word Quasiperiodicity Suffix tree 

1 Introduction

The notion of periodicity in words and its many variants have been well-studied in numerous fields like combinatorics on words, pattern matching, data compression, automata theory, formal language theory, and molecular biology (see [10]). However the classic notion of periodicity is too restrictive to provide a description of a word such as abaababaaba, which is covered by copies of aba, yet not exactly periodic. To fill this gap, the idea of quasiperiodicity was introduced [1]. In a periodic word, the occurrences of the period do not overlap. In contrast, the occurrences of a quasiperiod in a quasiperiodic word may overlap. Quasiperiodicity thus enables the detection of repetitive structures that would be ignored by the classic characterization of periods.

The most well-known formalization of quasiperiodicity is the cover of word. A factor \(u\) of a word \(w\) is said to be a cover of \(w\) if \(u\ne w\), and every position in \(w\) lies within some occurrence of \(u\) in \(w\). Equivalently, we say that \(u\) covers \(w\). Note that a cover of \(w\) must also be a border—both prefix and suffix—of \(w\). Thus, in the above example, aba is the shortest cover of abaababaaba.

A linear-time algorithm for computing the shortest cover of a word was proposed by Apostolico et al. [3], and a linear-time algorithm for computing all the covers of a word was proposed by Moore and Smyth [22]. Breslauer [4] gave an online linear-time algorithm computing the minimal cover array of a word—a data structure specifying the shortest cover of every prefix of the word. Li and Smyth [21] provided a linear-time algorithm for computing the maximal cover array of a word, and showed that, analogous to the border array [9], it actually determines the structure of all the covers of every prefix of the word.

A known extension of the notion of cover is the notion of seed. A seed is not necessarily aligned with the ends of the word being covered, but is allowed to overflow on either side. More formally, a word \(u\) is a seed of \(w\) if \(u\) is a factor of \(w\) and \(w\) is a factor of some word \(y\) covered by \(u\). Seeds were first introduced by Iliopoulos et al. [17]. A linear-time algorithm for computing the shortest seed of a word was given by Kociumaka et al. [18].

Still it remains unlikely that an arbitrary word, even over the binary alphabet, has a cover (or even a seed). For example, abaaababaabaaaababaa is a word that not only has no cover, but whose every prefix also has no cover. In this article we provide a natural form of quasiperiodicity. We introduce the notion of partial covers, that is, factors covering at least a given number of positions in \(w\). Recently, Flouri et al. [13] suggested a related notion of enhanced covers which are additionally required to be borders of the word.

Partial covers can be viewed as a relaxed variant of covers alternative to approximate covers [23]. The approximate covers require each position to lie within an approximate occurrence of the cover. This allows for small irregularities within each fragment of a word. On the other hand partial covers require exact occurrences but drop the condition that all positions need to be covered. This allows some fragments to be completely irregular as long as the total length of such fragments is small. Due to the requirement of exact occurrences in partial covers they enjoy a number of combinatorial properties thanks to which they can be computed more efficiently than approximate covers, where the time complexity rarely drops below quadratic and some problems are even NP-hard.

Let \({ Covered}(u,w)\) denote the number of positions in \(w\) covered by occurrences of the word \(u\) in \(w\); we call this value the cover index of \(u\) within \(w\) (see Fig. 1). We primarily focus on the following two problems, but the tools we develop can be used to answer a number of questions concerning partial covers, some of which are discussed in the conclusions.
Fig. 1

For \(u=\) aabaaa and \(w=\) aaabaaaabaaabaaababaabaaa we have \({ Covered}(u,w)=21\)

PartialCovers problem

   Input: a word \(w\) of length \(n\) and a positive integer \(\alpha \le n\).

   Output: all shortest factors \(u\) such that \({ Covered}(u,w)\ge \alpha \).

Each factor given in the output is represented by the starting and ending position of its occurrence in \(w\).

Example 1

Let w \(=\) bcccacccaccaccb and \(\alpha =11\). Then the only shortest \(\alpha \)-partial covers are ccac and cacc.

AllPartialCovers problem

   Input: a word \(w\) of length \(n\).

   Output: for all \(\alpha =1,\ldots ,n\), a shortest factor \(u\) such that \({ Covered}(u,w)\ge \alpha \).

Our contribution. The following summarizes our main result.

Theorem 1

The PartialCovers and AllPartialCovers problems can be solved in \(\mathcal {O}(n\log n)\) time and \(\mathcal {O}(n)\) space.

We extensively use suffix trees, for an exposition see [8, 9]. A suffix tree of a word is a compact trie of its suffixes, the nodes of the trie which become nodes of the suffix tree are called explicit nodes, while the other nodes are called implicit. Each edge of the suffix tree can be viewed as an upward maximal path of implicit nodes starting with an explicit node. Moreover, each node belongs to a unique path of that kind. Then, each node of the trie can be represented in the suffix tree by the edge it belongs to and an index within the corresponding path. Each factor of the word corresponds to an explicit or implicit node of the suffix tree. A representation of this node is called the locus of the factor. Our algorithm finds the loci of the shortest partial covers, it is then straightforward to locate an occurrence for each of them.

1.1 A Sketch of the Algorithm

The algorithm first augments the suffix tree of \(w\), that is, a linear number of implicit extra nodes become explicit. Then, each node of the augmented tree is annotated with two integer values. They allow for determining the size of the covered area for each implicit node by a simple formula, since limited to a single edge of the augmented suffix tree, these values form an arithmetic progression. This yields a solution to the PartialCovers problem. For an efficient solution to the AllPartialCovers problem, we additionally find the upper envelope of a number of line segments constructed from the arithmetic progressions.

1.2 Structure of the Paper

In Sect. 2 we formally introduce the augmented and annotated suffix tree that we call Cover Suffix Tree. We show its basic properties and present its application for PartialCovers and AllPartialCovers problems. Section 4 is dedicated to the construction of the Cover Suffix Tree. Before that, Sect. 3 presents an auxiliary data structure being an extension of the classical Union/Find data structure; its implementation is given later, in Sect. 5. Additional applications of the Cover Suffix Tree are given in Sects. 6 and 7. The former presents how the data structure can be used to compute all distinct primitively rooted squares in a word and a linear-sized representation of all the seeds in a word. The latter contains a short discussion of variants of the PartialCovers problem that can be solved in a similar way.

A preliminary version of this work appeared in the Proceedings of the Twenty-Fourth Annual Symposium on Combinatorial Pattern Matching, pp. 177–188, 2013.

2 Augmented and Annotated Suffix Trees

Let \(w\) be a word of length \(n\) over a totally ordered alphabet \(\varSigma \). The suffix tree \(T\) of \(w\) can be constructed in \(\mathcal {O}(n\log {|\varSigma |})\) time [12, 24]. For an explicit or implicit node \(v\) of \(T\), we denote by \(\hat{v}\) the word obtained by spelling the characters on a path from the root to \(v\). We also denote \(|v|=|\hat{v}|\). As in most applications of the suffix tree, the leaves of \(T\) play an auxiliary role and do not correspond to factors (actually they are suffixes of \(w\#\), where \(\# \notin \varSigma \)). They are labeled with the starting positions of the suffixes of \(w\).

We introduce the Cover Suffix Tree of \(w\), denoted by \({ CST}(w)\), as an augmented—new nodes are added—suffix tree in which the nodes are annotated with information relevant to covers. \({ CST}(w)\) is similar to the data structure named Minimal Augmented Suffix Tree (see [2, 6]).

For a set \(X\) of integers and \(x\in X\), we define
$$\begin{aligned} {{ next}}_X(x)=\min \{y\in X, y>x\}, \end{aligned}$$
and we assume \({{ next}}_X(x)=\infty \) if \(x=\max X\). By \({ Occ}(v,w)\) we denote the set of starting positions of occurrences of \(\hat{v}\) in \(w\). For any \(i \in { Occ}(v,w)\), we define:
$$\begin{aligned} \delta (i,v) = {{ next}}_{{ Occ}(v,w)}(i)-i. \end{aligned}$$
Note that \(\delta (i,v)=\infty \) if \(i\) is the last occurrence of \(\hat{v}\). Additionally, we define:
$$\begin{aligned} cv(v)\,=\, Covered(\hat{v},w),\, \, \Delta (v)\,=\, \big |\left\{ i\in { Occ}(v,w)\, :\, \delta (i,v)\ge |v|\right\} \big |; \end{aligned}$$
see, for example, Fig. 2.
Fig. 2

Let \(w=\) bcccacccaccaccb and let \(v\) be the node corresponding to cacc. We have \({ Occ}(v,w)=\{4, 8, 11\}\), \({ cv}(v) = 11\), \(\varDelta (v)=2\)

A word \(u\) is called primitive if \(u=y^k\) for a word \(y\) and an integer \(k\) implies that \(y=u\), and non-primitive otherwise. A square \(u^2\) is called primitively rooted if \(u\) is primitive.

Observation 1

Let \(v\) be a node in the suffix trie of \(w\). Then \(\hat{v}\hat{v}\) is a primitively rooted square in \(w\) if and only if there exists \(i \in { Occ}(v,w)\) such that \(\delta (i,v)=|v|\).

Proof

Recall that, by the synchronization property of primitive words (see [9]), \(\hat{v}\) is primitive if and only if it occurs exactly twice in \(\hat{v}\hat{v}\).
  • \((\Rightarrow )\) If \(\hat{v}\hat{v}\) occurs in \(w\) at position \(i\) then \(\delta (i,v)=|v|\).

  • \((\Leftarrow )\) If \(\delta (i,v)=|v|\) then obviously \(\hat{v}\hat{v}\) occurs in \(w\) at position \(i\). Additionally, if \(\hat{v}\) was not primitive then \(\delta (i,v)<|v|\) would hold.

\(\square \)

In \({ CST}(w)\), we introduce additional explicit nodes called extra nodes, which correspond to halves of primitively rooted square factors of \(w\). Moreover we annotate all explicit nodes (including extra nodes) with the values \(cv,\Delta \); see, for example, Fig. 3. The number of extra nodes is bounded by the number of distinct squares, which is linear [14], so \({ CST}(w)\) takes \(\mathcal {O}(n)\) space.

Lemma 1

Let \(v_1,v_2,\ldots ,v_k\) be the consecutive implicit nodes on the edge from an explicit node \(v\) of \({ CST}(w)\) to its explicit parent. Then for \(1\le i \le k\) we have
$$\begin{aligned} cv(v_i) = cv(v)-i\Delta (v), \end{aligned}$$
in particular \(({ cv}(v_i))_{i=1}^k\) forms an arithmetic progression.

Proof

Note that \({ Occ}(v_i,w) = { Occ}(v,w)\), since otherwise \(v_i\) would be an explicit node of \({ CST}(w)\). Also note that if any two occurrences of \(\hat{v}\) in \(w\) overlap, then the corresponding occurrences of \(\hat{v_i}\) overlap. Otherwise, by Observation 1, the path from \(v\) to \(v_i\) (excluding \(v\)) would contain an extra node. Hence, when we go up from \(v\) (before reaching its parent) the size of the covered area decreases at each step by \(\varDelta (v)\). \(\square \)

Fig. 3

\({ CST}(w)\) for \(w=\) bcccacccaccaccb. It contains four extra nodes that are denoted by squares in the figure. Each node is annotated with \(cv(v), \Delta (v)\). Leaves are omitted for clarity

Example 2

Consider the word \(w\) from Fig. 3. The word cccacc corresponds to an explicit node of \({{{ CST}}}(w)\); we denote it by \(v\). We have \({ cv}(v)=10\) and \(\varDelta (v)=1\) since the two occurrences of the factor cccacc in \(w\) overlap. The word cccac corresponds to an implicit node \(v'\) and \({ cv}(v') = 10 - 1 = 9\). Now the word ccca corresponds to an extra node \(v''\) of \({{{ CST}}} (w)\). Its occurrences are adjacent in \(w\) and \({ cv}(v'')=8\), \(\varDelta (v'')=2\). The word ccc corresponds to an implicit node \(v'''\) and \({ cv}(v''') = 8 - 2 = 6\).

As a consequence of Lemma 1 we obtain the following result. Recall that the locus of a factor \(v\) of \(w\), given by its start and end position in \(w\), can be found in \(\mathcal {O}(\log \log |v|)\) time [20].

Lemma 2

Assume we are given \({ CST}(w)\). Then we can compute:
  1. (1)

    for any \(\alpha \), the loci of the shortest \(\alpha \)-partial covers in linear time;

     
  2. (2)

    given the locus of a factor \(u\) in the suffix tree \({ CST}(w)\), the cover index \({ Covered}(u,w)\) in \(\mathcal {O}(1)\) time.

     

Proof

Part (2) is a direct consequence of Lemma 1. As for part (1), for each edge of \({ CST}(w)\), leading from \(v\) to its parent \(v'\), we need to find minimum \(|v| \ge j > |v'|\) for which \({ cv}(v)-\Delta (v) \cdot (|v|-j) \ge \alpha \). Such a linear inequality can be solved in constant time. \(\square \)

Due to this fact the efficiency of the PartialCovers problem relies on the complexity of \({ CST}(w)\) construction. In turn, the following lemma, also a consequence of Lemma 1, can be used to solve AllPartialCovers problem provided that \({ CST}(w)\) is given. As a tool a solution to the geometric problem of upper envelope [16] is applied.

Lemma 3

Assume we are given \({ CST}(w)\). Then we can compute the locus of a shortest \(\alpha \)-partial cover for each \(\alpha =1,2,\ldots ,n\) in \(\mathcal {O}(n\log n)\) time and \(\mathcal {O}(n)\) space.

Proof

Consider an edge of \({ CST}(w)\) from \(v\) to its parent \(v'\) containing \(k\) implicit nodes. For each such edge, we form a line segment on the plane connecting points \((|v|,{ cv}(v))\) and \((|v|-k,{ cv}(v)-k\cdot \Delta (v))\) (if there are no implicit nodes on the edge, the line segment is a single point). Denote all such line segments obtained from \({ CST}(w)\) as \(s_1,\ldots ,s_m\); we have \(m=\mathcal {O}(n)\). We consider the upper envelope \(\mathcal {E}\) of the set of these segments. Formally, if each \(s_i\) connecting points \((x_i,y_i)\) and \((x'_i,y'_i)\), \(x_i \le x'_i\), is interpreted as a linear function on a domain \([x_i,x'_i]\), \(\mathcal {E}\) is defined as a function \(\mathcal {E}: [1,n] \rightarrow [1,n]\) such that:
$$\begin{aligned} \mathcal {E}(x) = \max \{s_i(x)\,:\, i\in \{1,\ldots ,m\},\quad x \in [x_i,x'_i]\}. \end{aligned}$$
Here we are actually interested in an integer envelope \(\mathcal {E}'\), that is, \(\mathcal {E}\) limited to integer arguments, see Fig. 4. By Lemma 1, for any \(j \in \{1,\ldots ,n\}\), \(\mathcal {E}'(j)\) equals the maximum of \({ Covered}(u,w)\) over all factors \(u\) of \(w\) such that \(|u|=j\). A piecewise linear representation of \(\mathcal {E}\) can be computed in \(\mathcal {O}(m\log {m})\) time and \(\mathcal {O}(m)\) space [16], therefore the function \(\mathcal {E}'\) for all its arguments can be computed in the same time complexity.
Fig. 4

Line segments constructed as in Lemma 3 for the \({ CST}(w)\) from Fig. 3. E.g., the line segment corresponding to the edge from cccacc to ccca is \((5,9)\)\((6,10)\). The marked points joined with a dashed polyline show the values of the integer upper envelope function \(\mathcal {E}'\). The prefix maxima sequence for \(\mathcal {E}'\) is as follows: \(\mu _1=\mu _2=\mu _3=10\), \(\mu _4=11\), \(\mu _5=\ldots =\mu _{12}=12\), \(\mu _{13}=13\), \(\mu _{14}=14\), \(\mu _{15}=15\). We infer that the lengths of the shortest \(\alpha \)-partial covers of \(w\) are: 1 for \(\alpha \le 10\), 4 for \(\alpha =11\), 5 for \(\alpha =12\), and \(\alpha \) for \(\alpha \ge 13\)

Let us introduce a prefix maxima sequence for \(\mathcal {E}'\): \(\mu _i = \max \{\mathcal {E}'(j)\,:\,j \in \{1,\ldots ,i\}\}\), with \(\mu _0=0\). Note that \(\mu _i\) is non-decreasing. If \(\mu _i>\mu _{i-1}\) then the shortest \(\alpha \)-partial cover for all \(\alpha \in (\mu _{i-1},\mu _i]\) has length \(i\). An example of such a partial cover can be recovered if we explicitly store the initial line segments used in the pieces of the representation of \(\mathcal {E}\). Thus the solution of the AllPartialCovers problem can be obtained from the sequence \(\mu _i\) in \(\mathcal {O}(m)=\mathcal {O}(n)\) time. \(\square \)

In the following two sections we provide an \(\mathcal {O}(n\log n)\) time construction of \({ CST}(w)\). Together with Lemmas 2 and 3, it yields Theorem 1.

3 Extension of Disjoint-Set Data Structure

In this section we extend the classic disjoint-set data structure to compute the change lists of the sets being merged, as defined below. First, let us extend the \({ next}\) notation. For a partition \(\mathcal {P}= \{P_1,\ldots ,P_k\}\) of \(U=\{1,\ldots ,n\}\), we define
$$\begin{aligned} {{ next}}_{\mathcal {P}}(x)={{ next}}_{P_i}(x)\quad \text {where}\quad x\in P_i. \end{aligned}$$
Now for two partitions \(\mathcal {P}, \mathcal {P}'\) let us define the change list (see also Fig. 5) by
$$\begin{aligned} { ChangeList}(\mathcal {P},\mathcal {P}')= \{(x, {{ next}}_{\mathcal {P}'}(x)): {{ next}}_{\mathcal {P}}(x)\ne {{ next}}_{\mathcal {P}'}(x)\}. \end{aligned}$$
Fig. 5

Let \(\mathcal {P}\) be the partition of \(\{1,\ldots ,9\}\) whose classes consist of leaves in the subtrees rooted at children of \(v\), \(\mathcal {P}=\{\{1,3,4\},\{2,5,6,7\},\{8,9\}\}\), and let \(\mathcal {P}'=\{\{1,\ldots ,9\}\}\). Then \({ ChangeList}(\mathcal {P},\mathcal {P}') = \{(1,2),\,(2,3),\,(4,5),\,(7,8)\}\) (depicted by dotted arrows)

We say that \((\mathcal {P},{ id})\) is a partition of \(U\) labeled by \(L\) if \(\mathcal {P}\) is a partition of \(U\) and \({ id}: \mathcal {P}\rightarrow L\) is a one-to-one (injective) mapping. A label \(\ell \in L\) is called active if \({ id}(P)=\ell \) for some \(P\in \mathcal {P}\) and free otherwise.

Lemma 4

Let \(n\le k\) be positive integers such that \(k\) is of magnitude \(\Theta (n)\). There exists a data structure of size \(\mathcal {O}(n)\), which maintains a partition \((\mathcal {P},{ id})\) of \(\{1,\ldots ,n\}\) labeled by \(L=\{1,\ldots ,k\}\) and supports the following operations:
  • \({ Find}(x)\) for \(x\in \{1,\ldots ,n\}\) gives the label of \(P\in \mathcal {P}\) containing \(x\).

  • \({ Union}(I,\ell )\) for a set \(I\) of active labels (\(|I| \ge 2\)) and a free label \(\ell \) replaces all \(P\in \mathcal {P}\) with labels in \(I\) by their set-theoretic union with the label \(\ell \). The change list of the corresponding modification of \(\mathcal {P}\) is returned.

Initially \(\mathcal {P}\) is a partition into singletons with \({ id}(\{x\})=x\). Any valid sequence of \({ Union}\) operations is performed in \(\mathcal {O}(n\log n)\) time. A single \({ Find}\) operation takes \(\mathcal {O}(1)\) time.

Note that these are actually standard disjoint-set data structure operations except for the fact that we require \({ Union}\) to return the change list. The technical proof of Lemma 4 is postponed until Sect. 5.

4 \(\mathcal {O}(n \log n)\)-Time Construction of \({ CST}(w)\)

The suffix tree of \(w\) augmented with extra nodes is called the skeleton of \({ CST}(w)\), which we denote by \({ sCST}(w)\). It could be constructed using the fact that all square factors of a word can be computed in linear time [11, 15]. However, we do not need such a complicated machinery here. We will compute \({ sCST}(w)\) on the fly, simultaneously annotating the nodes with \({ cv}\), \(\varDelta \).

We introduce auxiliary notions related to covered area of nodes:
$$\begin{aligned} { cv}_h(v)=\sum _{\begin{array}{c} i\in { Occ}(v,w)\\ \delta (i,v)<h \end{array}} \delta (i,v),\quad \Delta _h(v)=|\{i\in { Occ}(v,w)\, :\, \delta (i,v) \ge h \}|. \end{aligned}$$

Observation 2

\({ cv}(v)\,=\, { cv}_{|v|}(v)+\varDelta _{|v|}(v)\cdot |v|,\, \Delta (v)=\Delta _{|v|}(v).\)

Example 3

Let \(w=\) bcccacccaccaccb and let \(v\) be the node corresponding to cacc, as in Fig. 2; \({ Occ}(v,w)=\{4,8,11\}\). We have: \({ cv}_4(v)=3\)\(\varDelta _4(v)=\varDelta (v)=2\), \({ cv}(v)=3+2 \cdot 4=11\).

In the course of the algorithm some nodes will have their values \(c,\varDelta \) already computed; we call them processed nodes. Whenever \(v\) will be processed, so will its descendants.

The algorithm processes inner nodes \(v\) of \({ sCST}(w)\) in the order of non-increasing height \(h=|v|\). The height is not defined for leaves, so we start with \(h=n+1\). Extra nodes are created on the fly using Observation 1 (this takes place in the auxiliary \({{{ Lift}}}\) routine).

We maintain the partition \(\mathcal {P}\) of \(\{1,\ldots , n\}\) given by sets of leaves of subtrees rooted at peak nodes. Initially the peak nodes are the leaves of \({ sCST}(w)\). Each time we process \(v\), all its children are peak nodes. Consequently, after processing \(v\) they are no longer peak nodes and \(v\) becomes a new peak node. The sets in the partition are labeled with identifiers of the corresponding peak nodes. Recall that leaves are labeled with the starting positions of the corresponding suffixes. We allow any labeling of the remaining nodes as long as each node of \({ sCST}(w)\) has a distinct label of magnitude \(\mathcal {O}(n)\). For this set of labels we store the data structure of Lemma 4 to compute the change list of the changing partition.

We maintain the following technical invariant (see Fig. 6).
Fig. 6

One stage of the algorithm, where the peak nodes are \(v_1,\ldots ,v_5\) while the currently processed node is \(v\). If \(i \in { List}[d]\) and \(v_3={ Find}(i)\), then \(d=\delta (i,v_3)=Dist[i]\). The current partition is \(\mathcal {P}=\{{ Leaves}(v_1),\, { Leaves}(v_2),\) \(\, { Leaves}(v_3),\, { Leaves}(v_4), \, { Leaves}(v_5)\}\). After \(v\) is processed, the partition changes to \(\mathcal {P}=\{{ Leaves}(v_1),\, { Leaves}(v_2),\, { Leaves}(v),\, { Leaves}(v_5)\}\). The \({ Union}\) operation merges \({ Leaves}(v_4), { Leaves}(v_3)\) and returns the corresponding change list

Invariant \((h)\):

(A)   For each peak node \(z\) we store: \({ cv}'[z]={ cv}_h(z),\,\Delta '[z]=\varDelta _h(z).\)

(B)   For each \(i \in \{1,\ldots ,n\}\) we store \({ Dist}[i]=\delta (i,\,{ Find}(i))\).

(C)   For each \(d<h\) we store \({ List}[d]\,=\, \{i\,:\, { Dist}[i]=d\}\).

We use two auxiliary routines. The \({ Lift}\) operation updates \({ cv}'\) and \(\varDelta '\) values when \(h\) decrements. It also creates all extra nodes of depth \(h\). The \({ LocalCorrect}\) operation is used for updating \({ cv}'\) and \(\varDelta '\) values for children of the node \(v\). The \({ Dist}\) and \({ List}\) arrays are stored to enable efficient implementation of these two routines.

4.1 Description of the \({{{ Lift}}}(h)\) Operation

The procedure \({{{ Lift}}}\) plays an important preparatory role in processing the current node. According to part (A) of our invariant, for all peak nodes \(z\) we know the values: \(cv'[z]= cv_{h+1}(z),\,\Delta '[z]=\Delta _{h+1}(z).\) Now we have to change \(h+1\) to \(h\) and guarantee validity of the invariant: \(cv'[z]= cv_{h}(z),\,\Delta '[z]=\Delta _{h}(z).\) This is exactly how the following operation updates \({ cv}'\) and \(\varDelta '\).

It also creates all extra nodes of depth \(h\) that were not explicit nodes of the suffix tree. By Observation 1, if \(i\in {{{ List}}}[h]\) then at position \(i\) in \(w\) there is an occurrence of a primitively rooted square of half length \(h\). Consequently, an extra node corresponding to this occurrence is created in the \({{{ Lift}}}\) operation.

4.2 Description of the \({ LocalCorrect}(p,q,v)\) Operation

Here we assume that \(\hat{v}\) occurs at positions \(p<q\) and that these are consecutive occurrences. Moreover, we assume that these occurrences are followed by distinct characters, i.e. \((p,q)\in { ChangeList}(v)\). The \({ LocalCorrect}\) procedure updates \({ Dist}[p]\) to make part (B) of the invariant hold for \(p\) again. The data structure \({ List}\) is updated accordingly so that (C) remains satisfied.

4.3 Complexity of the Algorithm

In the course of the algorithm we compute \({ ChangeList}(v)\) for each \(v\in T\). We have:
$$\begin{aligned} \sum _{v \in T}\; |{ ChangeList}(v)|\; =\; \mathcal {O}(n\log n), \end{aligned}$$
since each \({ ChangeList}\) is a result of \({ Union}\) operation and the total cost of such operations, by Lemma 4, is \(\mathcal {O}(n \log n)\). Consequently we perform \(\mathcal {O}(n\log n)\) operations \({ LocalCorrect}\). In each of them at most one element is added to a list \({ List}[d]\) for some \(d\). Hence the total number of insertions to these lists is also \(\mathcal {O}(n\log n)\).

The cost of each operation \({ Lift}\) is proportional to the total size of the list \({ List}[h]\) processed in this operation. For each \(h\), the list \({ List}[h]\) is processed once and the total number of insertions into lists is \(\mathcal {O}(n\log n)\), therefore the total cost of all operations \({ Lift}\) is also \(\mathcal {O}(n\log n)\). This proves the following fact which, together with Lemmas 2 and 3, implies our main result (Theorem 1).

Lemma 5

Algorithm ComputeCST constructs \({ CST}(w)\) in \(\mathcal {O}(n\log n)\) time and \(\mathcal {O}(n)\) space, where \(n=|w|\).

5 Implementation Details

In this section we give a proof of Lemma 4. We use an approach similar to Brodal and Pedersen [5] (who use the results of [7]) originally devised for computation of maximal quasiperiodicities. Theorem 3 of [5] states that a subset \(X\) of a linearly ordered universe can be stored in a height-balanced tree of linear size supporting the following operations:
  • \(X.{ MultiInsert}(Y)\): insert all elements of \(Y\) to \(X\),

  • \(X.{ MultiPred}(Y)\): return all \((y,x)\) for \(y\in Y\) and \(x=\max \{z\in X, z<y\}\),

  • \(X.{ MultiSucc}(Y)\): return all \((y,x)\) for \(y\in Y\) and \(x=\min \{z\in X, z>y\}\),

in \(O\left( |Y|\max \left( 1, \log \frac{|X|}{|Y|}\right) \right) \) time.

Recall that our goal is to implement a sequence od \({ Find}\) and \({ Union}\) operations on a dynamic partition \((\mathcal {P},{ id})\) of \(\{1,\ldots ,n\}\) labeled by identifiers from a set \(L\). Each \({ Union}\) operation is given a list of labels of sets in the partition and is to return a change list of these sets after merge. The label of \(P \in \mathcal {P}\) is denoted as \({ id}(P)\).

In the data structure we store each \(P\in \mathcal {P}\) as a height-balanced tree. Additionally, we store several auxiliary arrays, whose semantics follows. For each \(x\in \{1,\ldots ,n\}\) we maintain a value \({ next}[x]={{ next}}_\mathcal {P}(x)\) and a pointer \({ tree}[x]\) to the tree representing \(P\) such that \(x\in P\). For each \(P\in \mathcal {P}\) (technically for each tree representing \(P\in \mathcal {P}\)) we store \({ id}[P]\) and for each \(\ell \in L\) we store \({ id}^{-1}[\ell ]\), a pointer to the corresponding tree (null for free labels).

Answering \({ Find}\) is trivial as it suffices to follow the \({ tree}\) pointer and return the \({ id}\) value. The \({ Union}\) operation is performed according to the pseudocode given below (for brevity we write \(P_{i}\) instead of \({ id}^{-1}[i]\)).

Claim

The \({{{ Union}}}\) operation correctly computes the change list and updates the data structure.

Proof

In the \({{{ Union}}}\) operation for sets \(P_i\), \(i \in I\), we find the largest set \(P_{i_0}\) and \({{{ MultiInsert}}}\) all the elements of the remaining sets to \(P_{i_0}\). If \((a,b)\) is in the change list, then \(a\) and \(b\) come from different sets \(P_i\), in particular at least one of them does not come from \(P_{i_0}\). Depending on which one it is, the pair \((a,b)\) is found by \({{{ MultiPred}}}\) or \({{{ MultiSucc}}}\) operation. While computing \(C\), the table \({ next}\) is not updated yet (i.e. corresponds to the state before \({{{ Union}}}\) operation) while \(S\) is already updated. Consequently the pairs inserted to \(C\) indeed belong to the change list. Once \(C\) is proved to be the change list, it is clear that \({ next}\) is updated correctly. For the other components of the data structure, correctness of updates is evident. \(\square \)

Claim

Any sequence of \({ Union}\) operations takes \(\mathcal {O}(n\log n)\) time in total.

Proof

Let us introduce a potential function \(\varPhi (\mathcal {P}) = \sum _{P\in \mathcal {P}} |P|\log |P|\). We shall prove that the running time of a single \({ Union}\) operation is proportional to the increase in potential. Clearly
$$\begin{aligned} 0 \le \varPhi (\mathcal {P}) = \sum _{P\in \mathcal {P}} |P|\log |P| \le \sum _{P\in \mathcal {P}} |P|\log n = n\log n, \end{aligned}$$
so this suffices to obtain the desired \(\mathcal {O}(n\log n)\) bound.
Let us consider a \({ Union}\) operation that merges partition classes of sizes \(p_1\ge p_2 \ge \cdots \ge p_k\) to a single class of size \(p = \sum _{i=1}^k p_i\). The most time-consuming steps of the algorithm are the operations on height-balanced trees, which, for single \(i\), run in \(O\left( \max \left( p_i, p_i\log \tfrac{p}{p_i}\right) \right) \) time. These operations are not performed for the largest set and for the remaining ones we have \(p_i < \frac{1}{2} p\) (i.e. \(\log {\frac{p}{p_i}}\ge 1\)). This lets us bound the time complexity of the \({ Union}\) operation as follows:
$$\begin{aligned} \sum _{i=2}^k \max \left( p_i, p_i\log \tfrac{p}{p_i}\right)&= \sum _{i=2}^k p_i\log \tfrac{p}{p_i} \le \sum _{i=1}^k p_i\log \tfrac{p}{p_i} \\&= \sum _{i=1}^k p_i(\log p - \log p_i) = p\log p - \sum _{i=1}^k p_i\log p_i, \end{aligned}$$
which is equal to the increase in potential. \(\square \)

6 By-Products of Cover Suffix Tree

In this section we present two additional applications of the Cover Suffix Tree. We show that, given \({ CST}(w)\) (or \({ CST}\) of a word that can be obtained from \(w\) in a simple manner), one can compute in linear time all distinct primitively rooted squares in \(w\) and a linear representation of all the seeds of \(w\), in particular, the shortest seeds of \(w\). This shows that constructing this data structure is at least as hard as computing all distinct primitively rooted squares and seeds. While there are linear-time algorithms for these problems [11, 15, 18, 19], they are all complex and rely on the combinatorial properties specific to the repetitive structures they seek for.

Theorem 2

Assume that the Cover Suffix Tree of a word of length \(n\) can be computed in \(T(n)\) time. Then all distinct primitively rooted squares in a word \(w\) of length \(n\) can be computed in \(T(2n)\) time.

Proof

Let \(\mathtt {0} \notin \varSigma \) be a special symbol. Let \(\varphi :\varSigma ^*\rightarrow (\varSigma \cup \{\mathtt {0}\})^*\) be a morphism such that \(\varphi (c)=\mathtt {0}c\) for any \(c \in \varSigma \). We consider the word \(w'=\varphi (w)\mathtt {0}\), that is, the word \(w\) with \(\mathtt {0}\)-characters inserted at all its inter-positions, e.g. if \(w=\mathtt {aabab}\) then \(w'=\mathtt {0a0a0b0a0b0}\).

Let us consider the set of explicit non-branching nodes of \({{{ CST}}}(w')\) and select among them the nodes corresponding to even-length factors of \(w'\) starting with the symbol \(\mathtt {0}\). It suffices to note that there is a one-to-one correspondence between these nodes and the halves of distinct primitively rooted squares in \(w\). \(\square \)

Recall that a word \(u\) is a seed of \(w\) if \(u\) is a factor of \(w\) and \(w\) is a factor of some word \(y\) covered by \(u\), see Fig. 7. The following lemma states that the set of all seeds of \(w\) has a representation of \(\mathcal {O}(n)\) size, where \(n=|w|\). This representation enables, e.g., simple computation of all shortest seeds of the word. By a range on a edge of a suffix tree we mean a number of consecutive nodes on this edge (obviously at most one of these nodes is explicit). Let \(w^R\) denote the reverse of the word \(w\).
Fig. 7

Seed of string aaabaabaabaaabaaba

Lemma 6

([17, 18]) The set of all seeds of \(w\) can be split into two disjoint classes. The seeds from one class form a single (possibly empty) range on each edge of the suffix tree of \(w\), while the seeds from the other class form a range on each edge of the suffix tree of \(w^R\).

We will show that given \({ CST}(w)\) and \({ CST}(w^R)\) we can compute the representation of all seeds from Lemma 6 in \(\mathcal {O}(n)\) time. Let us recall auxiliary notions of quasiseed and quasigap, see [18].

By \({ first}(u)\) and \({ last}(u)\) let us denote \(\min { Occ}(u)\) and \(\max { Occ}(u)\), respectively. We say that \(u\) is a complete cover in \(w\) if \(u\) is a cover of the word \(w[{ first}(u),{ last}(u)+|u|-1]\). The word \(u\) is called a quasiseed of \(w\) if \(u\) is a complete cover in \(w\), \({ first}(u) < |u|\) and \(n+1 - { last}(u) < 2|u|\). Alternatively, \(w\) can be decomposed into \(w=xyz\), where \(|x|,|z|<|u|\) and \(u\) is a cover of \(y\).

All quasiseeds of \(w\) lying on the same edge of the suffix tree with lower explicit endpoint \(v\) form a range with the lower explicit end of the range located at \(v\). The length of the upper end of the range is denoted as \(\mathsf quasigap (v)\). If the range is empty, we set \(\mathsf quasigap (v)=\infty \). Thus a representation of all quasiseeds of a given word can be provided using only the quasigaps of explicit nodes in the suffix tree. It is known that computation of quasiseeds is the hardest part of an algorithm computing seeds:

Lemma 7

([17, 18]) Assume quasigaps of all explicit nodes of suffix trees of \(w\) and \(w^R\) are known. Then a representation of all seeds of \(w\) from Lemma 6 can be found in \(\mathcal {O}(n)\) time.

It turns out that the auxiliary data in \({ CST}(w)\) and \({ CST}(w^R)\) enable constant-time computation of quasigaps of explicit nodes. By Lemma 7 this yields an \(\mathcal {O}(n)\) time algorithm for computing a representation of all the seeds of \(w\). This is stated formally in the following theorem.

Theorem 3

Assume that the Cover Suffix Tree of a word of length \(n\) can be computed in \(T(n)\) time. Given a word \(w\) of length \(n\), one can compute a representation of all seeds of \(w\) from Lemma 6 in \(T(n)\) time. In particular, all the shortest seeds of \(w\) can be computed within the same time complexity.

Proof

We show how to compute quasigaps for all explicit nodes of \({ CST}(w)\). The computation for \({ CST}(w^R)\) is symmetric. Note that \({ CST}(w)\) may contain more explicit nodes that the suffix tree of the word. In this case, the results from any maximal sequence of edges connected by non-branching explicit nodes in \({ CST}(w)\) need to be merged into a single range on the corresponding edge of the suffix tree.

By the definition of \({ cv}(v)\), an explicit node \(v\) of \({ CST}(w)\) is a complete cover in \(w\) if the following condition holds:
$$\begin{aligned} { cv}(v) = { last}(v) - { first}(v) + |v|. \end{aligned}$$
Thus for checking whether an explicit node \(v\) of \({ CST}(w)\) is a quasiseed of \(w\) it suffices to check whether this condition and the following equalities hold:
$$\begin{aligned} { first}(v) < |v|,\quad n+1 - { last}(v) < 2|v|. \end{aligned}$$
If \(v\) is not a quasiseed of \(w\), we have \(\mathsf quasigap (v)=\infty \), otherwise we can assume that \(\mathsf quasigap (v) \le |v|\).

Example 4

Consider the word \(w\) from Fig. 3, \(n=15\). The word cacc corresponds to an explicit node of \({ CST}(w)\); we denote it by \(v\). We have \({ cv}(v)=11\), \({ first}(v)=4\), \({ last}(v)=11\), and \({ last}(v) - { first}(v) + |v|=11\). Therefore cacc is a quasiseed of \(w\), see also Fig. 2.

By Lemma 1, the condition for any node on the edge ending at \(v\) to be a complete cover in \(w\) is very simple:
$$\begin{aligned} \varDelta (v) = 1. \end{aligned}$$
Assume this condition is satisfied and consider any implicit node \(v'\) on this edge, \(|v'|=k\). Then \(v'\) is a quasiseed if both inequalities:
$$\begin{aligned} { first}(v) < k\quad \text{ and }\quad n+1 - { last}(v) < 2k \end{aligned}$$
are satisfied. Thus in this case
$$\begin{aligned} \mathsf quasigap (v)=\max ({ first}(v)+1,\,\lceil (n-{ last}(v)+2)/2 \rceil ,\,| parent (v)|+1). \end{aligned}$$

Example 5

Consider the word \(w\) from Fig. 3. The word cccacc corresponds to an explicit node of \({ CST}(w)\); we denote it by \(v\). We have \({ cv}(v)=10\), \({ first}(v)=2\), \({ last}(v)=6\), and \({ last}(v) - { first}(v) + |v|=10\). Therefore cccacc is a quasiseed of \(w\). Since \(\varDelta (v)=1\), \(\mathsf quasigap (v)\) could be smaller than 6. However, \(\lceil (n-{ last}(v)+2)/2 \rceil =6\) and the above formula yields \(\mathsf quasigap (v)=6\).

This concludes a complete set of rules for computing \(\mathsf quasigap (v)\) for explicit nodes of \({ CST}(w)\). \(\square \)

7 Conclusions

We have presented an algorithm which constructs a data structure, called the Cover Suffix Tree, in \(\mathcal {O}(n\log n)\) time and \(\mathcal {O}(n)\) space. The Cover Suffix Tree has been developed in order to solve the PartialCovers and AllPartialCovers problem in \(\mathcal {O}(n)\) and \(\mathcal {O}(n\log n)\) time, respectively, but it also gives a well-structured description of the cover indices of all factors. Consequently, various questions related to partial covers can be answered efficiently. For example, with the Cover Suffix Tree one can solve in linear time a problem inverse to PartialCovers: find a factor of length between \(l\) and \(r\) that maximizes the number of positions covered. Also a similar problem to AllPartialCovers problem, to compute for all lengths \(l=1,\ldots ,n\) the maximum number of positions covered by a factor of length \(l\), can be solved in \(\mathcal {O}(n\log n)\) time. This solution was actually given implicitly in the proof of Lemma 3.

An interesting open problem is to reduce the construction time to \(\mathcal {O}(n)\). This could be difficult, though, since by the results of Sect. 6 this would yield alternative linear-time algorithms finding all distinct primitively rooted squares and computing seeds. The only known linear-time algorithms for these problems (see [11, 15, 18]) are rather complex.

Notes

Acknowledgments

Tomasz Kociumaka is supported by Polish budget funds for science in 2013–2017 as a research project under the ‘Diamond Grant’ program. Jakub Radoszewski receives financial support of Foundation for Polish Science. Tomasz Waleń is supported by Iuventus Plus Grant (IP2011 058671) of the Polish Ministry of Science and Higher Education.

References

  1. 1.
    Apostolico, A., Ehrenfeucht, A.: Efficient detection of quasiperiodicities in strings. Theor. Comput. Sci. 119(2), 247–265 (1993)MATHMathSciNetCrossRefGoogle Scholar
  2. 2.
    Apostolico, A., Preparata, F.P.: Data structures and algorithms for the string statistics problem. Algorithmica 15(5), 481–494 (1996)MATHMathSciNetCrossRefGoogle Scholar
  3. 3.
    Apostolico, A., Farach, M., Iliopoulos, C.S.: Optimal superprimitivity testing for strings. Inf. Process. Lett. 39(1), 17–20 (1991)MATHMathSciNetCrossRefGoogle Scholar
  4. 4.
    Breslauer, D.: An on-line string superprimitivity test. Inf. Process. Lett. 44(6), 345–347 (1992)MATHMathSciNetCrossRefGoogle Scholar
  5. 5.
    Brodal, G.S., Pedersen, C.N.S.: Finding maximal quasiperiodicities in strings. In: Giancarlo, R., Sankoff, D. (eds.) Combinatorial Pattern Matching, 11th Annual Symposium, CPM 2000. Lecture Notes in Computer Science, vol. 1848, pp. 397–411. Springer, Berlin (2000)Google Scholar
  6. 6.
    Brodal, G.S., Lyngsø, R.B., Östlin, A., Pedersen, C.N.S.: Solving the string statistics problem in time \({O}(n \log n)\). In: Widmayer, P., Ruiz, F.T., Bueno, R.M., Hennessy, M., Eidenbenz, S., Conejo, R. (eds.) Automata, Languages and Programming, 29th International Colloquium, ICALP 2002. Lecture Notes in Computer Science, vol. 2380, pp. 728–739. Springer, Berlin (2002)Google Scholar
  7. 7.
    Brown, M.R., Tarjan, R.E.: A fast merging algorithm. J. ACM 26(2), 211–226 (1979)MATHMathSciNetCrossRefGoogle Scholar
  8. 8.
    Crochemore, M., Rytter, W.: Jewels of Stringology. World Scientific, Singapore (2003)MATHGoogle Scholar
  9. 9.
    Crochemore, M., Hancart, C., Lecroq, T.: Algorithms on Strings. Cambridge University Press, Cambridge (2007)MATHCrossRefGoogle Scholar
  10. 10.
    Crochemore, M., Ilie, L., Rytter, W.: Repetitions in strings: algorithms and combinatorics. Theor. Comput. Sci. 410(50), 5227–5235 (2009)MATHMathSciNetCrossRefGoogle Scholar
  11. 11.
    Crochemore, M., Iliopoulos, C.S., Kubica, M., Radoszewski, J., Rytter, W., Waleń, T.: Extracting powers and periods in a word from its runs structure. Theor. Comput. Sci. 521, 29–41 (2014)MATHCrossRefGoogle Scholar
  12. 12.
    Farach, M.: Optimal suffix tree construction with large alphabets. In: 38th Annual Symposium on Foundations of Computer Science, FOCS ’97, pp. 137–143 (1997)Google Scholar
  13. 13.
    Flouri, T., Iliopoulos, C.S., Kociumaka, T., Pissis, S.P., Puglisi, S.J., Smyth, W., Tyczyński, W.: Enhanced string covering. Theor. Comput. Sci. 506(0), 102–114 (2013)MATHCrossRefGoogle Scholar
  14. 14.
    Fraenkel, A.S., Simpson, J.: How many squares can a string contain? J. Comb. Theory Ser. A 82(1), 112–120 (1998)MATHMathSciNetCrossRefGoogle Scholar
  15. 15.
    Gusfield, D., Stoye, J.: Linear time algorithms for finding and representing all the tandem repeats in a string. J. Comput. Syst. Sci. 69(4), 525–546 (2004)MATHMathSciNetCrossRefGoogle Scholar
  16. 16.
    Hershberger, J.: Finding the upper envelope of \(n\) line segments in \({O}(n \log n)\) time. Inf. Process. Lett. 33(4), 169–174 (1989)MATHMathSciNetCrossRefGoogle Scholar
  17. 17.
    Iliopoulos, C.S., Moore, D., Park, K.: Covering a string. Algorithmica 16(3), 288–297 (1996)MATHMathSciNetCrossRefGoogle Scholar
  18. 18.
    Kociumaka, T., Kubica, M., Radoszewski, J., Rytter, W., Waleń, T.: A linear time algorithm for seeds computation. In: Rabani, Y. (ed.) Proceedings of the Twenty-Third Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2012, pp. 1095–1112. SIAM (2012)Google Scholar
  19. 19.
    Kolpakov, R.M., Kucherov, G.: Finding maximal repetitions in a word in linear time. In: 40th Annual Symposium on Foundations of Computer Science, FOCS ’99, pp. 596–604. IEEE Computer Society (1999)Google Scholar
  20. 20.
    Kucherov, G., Nekrich, Y., Starikovskaya, T.A.: Cross-document pattern matching. In: Kärkkäinen, J., Stoye, J. (eds.) Combinatorial Pattern Matching—23rd Annual Symposium, CPM 2012. Lecture Notes in Computer Science, vol. 7534, pp. 196–207. Springer, Berlin (2012)Google Scholar
  21. 21.
    Li, Y., Smyth, W.F.: Computing the cover array in linear time. Algorithmica 32(1), 95–106 (2002)MATHMathSciNetCrossRefGoogle Scholar
  22. 22.
    Moore, D., Smyth, W.F.: An optimal algorithm to compute all the covers of a string. Inf. Process. Lett. 50(5), 239–246 (1994)MATHMathSciNetCrossRefGoogle Scholar
  23. 23.
    Sim, J.S., Park, K., Kim, S., Lee, J.: Finding approximate covers of strings. J. Korea Inf. Sci. Soc. 29(1), 16–21 (2002)Google Scholar
  24. 24.
    Ukkonen, E.: On-line construction of suffix trees. Algorithmica 14(3), 249–260 (1995)MATHMathSciNetCrossRefGoogle Scholar

Copyright information

© The Author(s) 2014

Open AccessThis article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.

Authors and Affiliations

  • Tomasz Kociumaka
    • 1
  • Solon P. Pissis
    • 2
  • Jakub Radoszewski
    • 1
  • Wojciech Rytter
    • 1
    • 3
  • Tomasz Waleń
    • 1
  1. 1.Faculty of Mathematics, Informatics and MechanicsUniversity of WarsawWarsawPoland
  2. 2.Department of InformaticsKing’s College LondonLondon UK
  3. 3.Faculty of Mathematics and Computer ScienceCopernicus UniversityToruńPoland

Personalised recommendations